Practical PDFs for optimization

 Having explored the landscape, I've settled on what is likely to be the most efficient way to do this. Via Hilbert and co there was a lot of old-timey exploration into polynomials that are positive everywhere. For a polynomial of one variable, the polynomial is positive everywhere iff there exists a rank-2 symmetric positive definite matrix $M$ such that for vandermonde type $z$, $z=\left[1,x,x^2,x^3,\dots x^n\right]$ we have $z^TMz$ as our positive polynomial of order $2n$. We could go about the process of determining what the coefficients would be in the context of a basis of order $2n$, but this would result in a cost of $2n$ for every evaluation as compared to the squaring followed by $2$ evaluations of cost $n$. It would require larger basis matrices to be stored and processed and could result in slight numerical error, a terrible thing when working with logs as is common in this context. This is not to say it would be bad if a smart person were to tackle it, but it is a chewy problem which requires a good deal more thought than I am able to justify. There is also the idea that maybe $R$ could be chosen in such a way that it would force positive definiteness in certain conditions. Alas, I don't desire this.

Anyway, long story short, we have landed on the following definition as that which is easiest for the computer to handle in the context of optimization problems:$$ P\left(x|C,\lambda\right) = \left( B\left(x\right)^T C\right )^2 \begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix} \text{ for } \lambda \in \left[0,1\right], C \in S^n \times S^n$$Easy enough huh? Derivatives play nicely as well:$$\frac{d P\left(x|C,\lambda\right)}{d x} = 2 \left( B\left(x\right)^T C * B\left(x\right)^T D C\right )\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$Given that:$$\frac{d^n}{dx^n} (f(x))^2 = \sum_{k=0}^{n} \binom{n}{k} \frac{d^k}{dx^k} 2f(x) \cdot \frac{d^{n-k}}{dx^{n-k}} f(x)$$We can then say:$$\frac{d^n}{dx^n} P\left(x|C,\lambda\right) = 2 \left(\sum_{k=0}^{n} \binom{n}{k} B\left(x\right)^T D^{k} C * B\left(x\right)^T D^{n-k} C\right)\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$In practice when taking logs it will be useful to exploit the log1p function using the form log1p(pdf/eps)+N*log(eps) which will protect you from any very small numerical errors.

If we really cared about performance, we could also be a little sneakier and look at $B$. You'll notice that $B$ is very ill-conditioned if the evaluation points $x$ are spread over a small region. This is a feature and not a bug! Taking the SVD of $B$ and dropping small singular values, we can derive
$$B^{m \times n} \approx U^{m \times k} \Sigma^{k \times k} V^{T,k \times n}$$While our small singular values are very small, our large ones are very well behaved, and our error doing this all is rather low. For example, over the interval $[-1.5,1.5]$, the relative error norm for $k=40$ when $n$ is $200$ is $\sim 1.5 \times 10^{-13}$. Since differentiation is all done spectrally in the coefficients, the error does not grow with differentiation.

This is also a good guideline to determine if your data needs to be rescaled. If you don't have any null-space in $B$ then you're likely to be unable to well fit your data.

No comments:

Post a Comment

March thoughts

Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-...