SciGood: April 2025

Boring setup: Suppose we have a probability mass function which we choose to represent as a vector $\vec{p} = \{ p_0, p_1, p_2, \cdots p_n \}$. We see that for any vector $\vec{r}$ we can create a valid probability mass function by squaring the values of the vector and then normalizing the resulting vector, i.e.$$p(r) = \frac{\vec{r}^2}{\vec{r}^T\vec{r}}$$We're also going to want to find the gradient so we calculate this quickly as follows*:$$\frac{\partial p(r)_j}{\partial r_k} = \frac{(\vec{r}^T\vec{r}) \cdot 2 \cdot r_k \cdot \delta_k(j) - \vec{r}^2 \cdot 2 \cdot r_k}{(\vec{r}^T\vec{r})^2}$$Which when simplified and extended to be a vector by vector operation becomes**:$$\frac{\partial \vec{p}(r)}{\partial \vec{r}} = \frac{2}{(\vec{r}^T\vec{r})} \cdot \left(\Lambda \left(\vec{r}\right) - \frac{\vec{r}^2 \otimes \vec{r}}{(\vec{r}^T\vec{r})}\right)$$

Loss functions: When performing log-likelihood maximization we take a group of events and we try to maximize the sum of what we think the log probabilities are. If our guess for the probabilities is $\hat{p}$ and the true distribution is $p$, then since data is represented in the sample proportionally to its probability*** we can calculate the expected mean for our log-likelihood $\mathbb{E}\left[\log \left( \hat{p} \right)\right] = \sum p_k \log \left( \hat{p}_k \right)$.

Log-likelihood techniques try to maximize this expectation, and randomly sampling data and adjusting works because the global maximum of $\mathbb{E}\left[\log \left( \hat{p} \right)\right]$ is located at $\hat{p} = p$. This presents the obvious question, are there other functions $S$ that have the property $\underset{\hat{p}}{\arg\min} \mathbb{E}\left[S \left( \hat{p} \right)\right] = p$?

Luckily we're well equipped to handle this problem if we have at least a 3rd grade education and once sniffed an algebra textbook. Since we want to find minimums we'll need to find some gradients, and to avoid any ugly edge cases that probability distributions have we're going to work in $\mathbb{R}^n$ via the $\vec{r}$ transformation we discussed earlier. Our criterion is:$$\frac{d \mathbb{E}\left[S \left( \hat{p}(\vec{r}) \right)\right]}{d \vec{r}} = \vec{0}$$Now we can do our classically brain-dead expansions, with the crucial, subtle, note that we remember the $p$ in the expectation calculation is fixed, we are just evaluating the case where $\hat{p}=p$****:$$\vec{0} = \frac{d}{d \vec{r}} \sum p_k S_k \left( p(r) \right) = \sum p(r)_k \frac{d S_k \left( p(r) \right)}{d \vec{r}} = \frac{d S\left( p(r) \right)}{d \vec{r}}^Tp(r)$$Next we're going to apply chain rule to the $\frac{d S\left( p(r) \right)}{d \vec{r}}$ term to get:$$\frac{d S\left( p(r) \right)}{d \vec{r}} = \frac{d S\left( p \right)}{d p}\frac{d p(r)}{d r}$$Plugging in our earlier gradient we then get:$$\frac{d S\left( p(r) \right)}{d \vec{r}} = \frac{d S\left( p \right)}{d p}\frac{2}{(\vec{r}^T\vec{r})} \cdot \left(\Lambda \left(\vec{r}\right) - \frac{\vec{r}^2 \otimes \vec{r}}{(\vec{r}^T\vec{r})}\right)$$We then cancel constant terms and plug back into the original constraint to get:$$p^T\frac{d S\left( p \right)}{d p}\left(\Lambda \left(\vec{r}\right) - p \otimes \vec{r}\right)=\vec{0}$$Now, since any diagonal matrix with non-zero and non-equal terms has full rank, and knowing that outer products have rank-1, and knowing that $\left(\Lambda \left(\vec{r}\right) - p \otimes \vec{r}\right)$ has a null-space that admits constant vectors, we see that only constant vectors can be multiplied by it to reach the zero condition, meaning that the criteria is satisfied for general probability vectors iff:$$\exists \lambda \in \mathbb{C}\text{ s.t. }p^T\frac{d S\left( p \right)}{d p} = \lambda \cdot \vec{1}$$The question now becomes, what non-trivial solutions exist?$\frac{d S\left( p \right)}{d p}$ need not necessarily be diagonal, meaning we have a surprising amount of flexibility. Furthermore if we examine the $\lambda=0$ case each $S_k$ is roughly independent here, granting a lot of wiggle room. A lazy solution to the PDE gives us the following though:$$f\left(\frac{p_0}{p_k}, \frac{p_1}{p_k}, \cdots, \frac{p_n}{p_k}\right) \forall f \text{ s.t. }f\text{ is everywhere differentiable}$$I am sorry for this, because this includes a vast variety of the most hideous loss functions you could imagine. For example, $S_k = \sum_{j=0}^n \sin\left(\frac{p_j}{p_k}\right)$ is a loss function (albeit one that only converges in a very VERY! small region). I'm going to close out this post here, I genuinely hate that this worked. Obviously the two extensions here are making it continuous for PDFs and finding less awful loss functions.

Also $\frac{1}{p}$ is loss????

*Here $\delta_k(j)$ is a function that is equal to $1$ when $k=j$ and $0$ otherwise.
**Here $\Lambda \left(\vec{r}\right)$ is used to denote a diagonal matrix constructed from the vector $\vec{r}$.
***This isn't a controversial statement, don't overthink it
****Basically just convince yourself that $p$ is a vector and $p(r)$ is a function

SciGood

Is this loss? All possible loss functions for probability estimation

March thoughts

Search This Blog