March thoughts

Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-in variables for derivatives, (e.g. if $y'' = y$ we can create the equivalent system $y' = w$ and $w' = y$).

We define a full model:
A full model for a system should exactly determine the evolution of a system based on the state of the system
This means that there are unique solutions that exist. This seems fair to us. This means that when we take a solution curve in the phase plane, it must not intersect another curve unless at a limit point.



We ask that our derivatives all be finite. It is ok if our paths are jagged, but we cannot jump between points in our system. This is the most controversial thing I'll ask be assumed. There are many reasons why you might not want this to be the case, but all the ones I can think of rely on us being at a molecular level.

We can say another thing. There is energy associated with our system, and it is lost over time. This is more controversial, but then only to a small subset of schizophrenics. For the rest of us we can happily claim that there are no periodic orbits in our phase planes.




If we aim to create a model that fully describes the evolution of a system
  1. Bounded variation: $ \|y'\| < \inf $
  2. Unique solutions that exist: Picard–Lindelöf theorem satisfied
  3. Non time-dependent: Phase‐portrait stationarity
  4. No perpetual motion: a-periodic
2 and 3 are interesting together. Since solutions are unique that means the phase portraits have no bifurcations. That is, the level sets are a mille-feuille style set of stacked $\mathbb{R}^{d-1}$ deformed planes.

This is a very strong condition, and is aided by phase lines being continuous. Small pertubations in one place create small pertubations everywhere (in delta-epsilon style).

Limited hidden variables
Full hidden variables
Constraints on those
Neural Network Analogues

Is this loss? All possible loss functions for probability estimation

 Boring setup: Suppose we have a probability mass function which we choose to represent as a vector $\vec{p} = \{ p_0, p_1, p_2, \cdots p_n \}$. We see that for any vector $\vec{r}$ we can create a valid probability mass function by squaring the values of the vector and then normalizing the resulting vector, i.e.$$p(r) = \frac{\vec{r}^2}{\vec{r}^T\vec{r}}$$We're also going to want to find the gradient so we calculate this quickly as follows*:$$\frac{\partial p(r)_j}{\partial r_k} = \frac{(\vec{r}^T\vec{r}) \cdot 2 \cdot r_k \cdot \delta_k(j) - \vec{r}^2 \cdot 2 \cdot r_k}{(\vec{r}^T\vec{r})^2}$$Which when simplified and extended to be a vector by vector operation becomes**:$$\frac{\partial \vec{p}(r)}{\partial \vec{r}} = \frac{2}{(\vec{r}^T\vec{r})} \cdot \left(\Lambda \left(\vec{r}\right) - \frac{\vec{r}^2 \otimes \vec{r}}{(\vec{r}^T\vec{r})}\right)$$


Loss functions: When performing log-likelihood maximization we take a group of events and we try to maximize the sum of what we think the log probabilities are. If our guess for the probabilities is $\hat{p}$ and the true distribution is $p$, then since data is represented in the sample proportionally to its probability*** we can calculate the expected mean for our log-likelihood $\mathbb{E}\left[\log \left( \hat{p} \right)\right] = \sum p_k \log \left( \hat{p}_k \right)$.

Log-likelihood techniques try to maximize this expectation, and randomly sampling data and adjusting works because the global maximum of $\mathbb{E}\left[\log \left( \hat{p} \right)\right]$ is located at $\hat{p} = p$. This presents the obvious question, are there other functions $S$ that have the property $\underset{\hat{p}}{\arg\min} \mathbb{E}\left[S \left( \hat{p} \right)\right] = p$?

Luckily we're well equipped to handle this problem if we have at least a 3rd grade education and once sniffed an algebra textbook. Since we want to find minimums we'll need to find some gradients, and to avoid any ugly edge cases that probability distributions have we're going to work in $\mathbb{R}^n$ via the $\vec{r}$ transformation we discussed earlier. Our criterion is:$$\frac{d \mathbb{E}\left[S \left( \hat{p}(\vec{r}) \right)\right]}{d \vec{r}} = \vec{0}$$Now we can do our classically brain-dead expansions, with the crucial, subtle, note that we remember the $p$ in the expectation calculation is fixed, we are just evaluating the case where $\hat{p}=p$****:$$\vec{0} = \frac{d}{d \vec{r}} \sum p_k S_k \left( p(r) \right) = \sum p(r)_k \frac{d S_k \left( p(r) \right)}{d \vec{r}} = \frac{d S\left( p(r) \right)}{d \vec{r}}^Tp(r)$$Next we're going to apply chain rule to the $\frac{d S\left( p(r) \right)}{d \vec{r}}$ term to get:$$\frac{d S\left( p(r) \right)}{d \vec{r}} = \frac{d S\left( p \right)}{d p}\frac{d p(r)}{d r}$$Plugging in our earlier gradient we then get:$$\frac{d S\left( p(r) \right)}{d \vec{r}} = \frac{d S\left( p \right)}{d p}\frac{2}{(\vec{r}^T\vec{r})} \cdot \left(\Lambda \left(\vec{r}\right) - \frac{\vec{r}^2 \otimes \vec{r}}{(\vec{r}^T\vec{r})}\right)$$We then cancel constant terms and plug back into the original constraint to get:$$p^T\frac{d S\left( p \right)}{d p}\left(\Lambda \left(\vec{r}\right) - p \otimes \vec{r}\right)=\vec{0}$$Now, since any diagonal matrix with non-zero and non-equal terms has full rank, and knowing that outer products have rank-1, and knowing that $\left(\Lambda \left(\vec{r}\right) - p \otimes \vec{r}\right)$ has a null-space that admits constant vectors, we see that only constant vectors can be multiplied by it to reach the zero condition, meaning that the criteria is satisfied for general probability vectors iff:$$\exists \lambda \in \mathbb{C}\text{ s.t. }p^T\frac{d S\left( p \right)}{d p} = \lambda \cdot \vec{1}$$The question now becomes, what non-trivial solutions exist?$\frac{d S\left( p \right)}{d p}$ need not necessarily be diagonal, meaning we have a surprising amount of flexibility. Furthermore if we examine the $\lambda=0$ case each $S_k$ is roughly independent here, granting a lot of wiggle room. A lazy solution to the PDE gives us the following though:$$f\left(\frac{p_0}{p_k}, \frac{p_1}{p_k}, \cdots, \frac{p_n}{p_k}\right) \forall f \text{ s.t. }f\text{ is everywhere differentiable}$$I am sorry for this, because this includes a vast variety of the most hideous loss functions you could imagine. For example, $S_k = \sum_{j=0}^n \sin\left(\frac{p_j}{p_k}\right)$ is a loss function (albeit one that only converges in a very VERY! small region). I'm going to close out this post here, I genuinely hate that this worked. Obviously the two extensions here are making it continuous for PDFs and finding less awful loss functions.

Also $\frac{1}{p}$ is loss????

*Here $\delta_k(j)$ is a function that is equal to $1$ when $k=j$ and $0$ otherwise.
**Here $\Lambda \left(\vec{r}\right)$ is used to denote a diagonal matrix constructed from the vector $\vec{r}$.
***This isn't a controversial statement, don't overthink it
****Basically just convince yourself that $p$ is a vector and $p(r)$ is a function

Fiber-Net ramble

 I've had this idea for a while, but haven't built one. Essentially, the common version of the universal approximation theorem states that we can arbitrarily well approximate any function given that we have unlimited linear operations and a nonlinearity of choice.

Signals in optical fiber can be filtered, meaning their signal is decreased by an arbitrary percentage. They can be split and merged, and non-linear phenomena like the Kerr effect exist in practice (https://www.nature.com/articles/s41467-023-41377-5). My suspicion is that this is enough to arbitrarily well approximate any function up to a scalar constant, and can be proven by taking standard neural network construction and simply multiplying the output down by a scalar factor at each layer in order that the 'increasing' components in an output stay the same or lower magnitude at the output.

What you would get in practice is a neural network capable of evaluating instantly and with almost unlimited throughput at no-cost. Particularly with CNNs and other image ingestion tools I can see this working very well.

Practical PDFs for optimization

 Having explored the landscape, I've settled on what is likely to be the most efficient way to do this. Via Hilbert and co there was a lot of old-timey exploration into polynomials that are positive everywhere. For a polynomial of one variable, the polynomial is positive everywhere iff there exists a rank-2 symmetric positive definite matrix $M$ such that for vandermonde type $z$, $z=\left[1,x,x^2,x^3,\dots x^n\right]$ we have $z^TMz$ as our positive polynomial of order $2n$. We could go about the process of determining what the coefficients would be in the context of a basis of order $2n$, but this would result in a cost of $2n$ for every evaluation as compared to the squaring followed by $2$ evaluations of cost $n$. It would require larger basis matrices to be stored and processed and could result in slight numerical error, a terrible thing when working with logs as is common in this context. This is not to say it would be bad if a smart person were to tackle it, but it is a chewy problem which requires a good deal more thought than I am able to justify. There is also the idea that maybe $R$ could be chosen in such a way that it would force positive definiteness in certain conditions. Alas, I don't desire this.

Anyway, long story short, we have landed on the following definition as that which is easiest for the computer to handle in the context of optimization problems:$$ P\left(x|C,\lambda\right) = \left( B\left(x\right)^T C\right )^2 \begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix} \text{ for } \lambda \in \left[0,1\right], C \in S^n \times S^n$$Easy enough huh? Derivatives play nicely as well:$$\frac{d P\left(x|C,\lambda\right)}{d x} = 2 \left( B\left(x\right)^T C * B\left(x\right)^T D C\right )\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$Given that:$$\frac{d^n}{dx^n} (f(x))^2 = \sum_{k=0}^{n} \binom{n}{k} \frac{d^k}{dx^k} 2f(x) \cdot \frac{d^{n-k}}{dx^{n-k}} f(x)$$We can then say:$$\frac{d^n}{dx^n} P\left(x|C,\lambda\right) = 2 \left(\sum_{k=0}^{n} \binom{n}{k} B\left(x\right)^T D^{k} C * B\left(x\right)^T D^{n-k} C\right)\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$In practice when taking logs it will be useful to exploit the log1p function using the form log1p(pdf/eps)+N*log(eps) which will protect you from any very small numerical errors.

If we really cared about performance, we could also be a little sneakier and look at $B$. You'll notice that $B$ is very ill-conditioned if the evaluation points $x$ are spread over a small region. This is a feature and not a bug! Taking the SVD of $B$ and dropping small singular values, we can derive
$$B^{m \times n} \approx U^{m \times k} \Sigma^{k \times k} V^{T,k \times n}$$While our small singular values are very small, our large ones are very well behaved, and our error doing this all is rather low. For example, over the interval $[-1.5,1.5]$, the relative error norm for $k=40$ when $n$ is $200$ is $\sim 1.5 \times 10^{-13}$. Since differentiation is all done spectrally in the coefficients, the error does not grow with differentiation.

This is also a good guideline to determine if your data needs to be rescaled. If you don't have any null-space in $B$ then you're likely to be unable to well fit your data.

Infinite basis non-square properties

It occurred to me that since the terms have finite integrals, we might be able to determine the integral of the function based on a linear combination of the coefficients. Using the tools from the method of moments we can pretty quickly confirm that the odd values are 0 and that the even values are of the form:$$\int_{-\infty}^{\infty}f_{2k}\left(x\right) dx = \sqrt{2 \sqrt{\pi} \frac{\left(2k-1\right)!!}{\left(2k \right)!!}}$$Since we know that $k!=k!!(k-1)!!$ we can recast $\frac{\left(2k-1\right)!!}{\left(2k \right)!!}$ as $frac{\left(2k\right)!}{\left(2k\right)!!^2}$. Since $2k!!=k!2^{k}$ we then can convert this to ${2k\choose k} 2^{-k}$. Thus we get our nice bit of code:

def integral_table(N):
    k = np.arange(0,N)
    k[1::2] = 0
    Ita = (1-(k)%4)*np.sqrt((comb(k,k/2)*(2.0**(1-k)))-2*(k==0))*np.sqrt(np.sqrt(np.pi))
    Ita[0] = np.sqrt(2*np.sqrt(np.pi))
    return Ita

Thus we can just multiply the vector from the integration table $\phi$ by the vector of coefficients to get the integral of the whole function as $\phi^Tc$. This means if we could determine conditions that forced positivity, we could ensure integrability to $1$.

Spectral differentiation for PDF basis with infinite support

Using the family of functions we defined a couple days ago of the form $f_n\left(x\right) = \sum^n_{k=0} c_{k,n} x^k e^{-\frac{x^2}{2}}$ we can produce some nice numerically stable methods for generating the basis and taking the derivative in spectral fashion. First, the code, so that you can verify it works.

import numpy as np
import matplotlib.pyplot as plt

def basis_gen(t,N):
    w = np.exp(-(t**2)/2)/(np.sqrt(np.sqrt(np.pi)))
    B = np.zeros(shape=(N,len(w)))
    B[:2,:] = np.vstack((w,-np.sqrt(2)*t*w))
    for k in range(2,N):
        B[k,:] = np.sqrt(1-(1/k))*B[k-2,:]
        if k%2:
            B[k,:] -= t*np.sqrt(2/k)*B[k-1,:]
        else:
            B[k,:] += t*np.sqrt(2/k)*B[k-1,:]
    return B

def differential_matrix(N):
    H = np.diag((0.5-np.arange(N-1)%2)*np.sqrt(np.arange(1,N)*2),k=1)
    H -= H.T
    return -H

dt = 0.001
N = 101
t = np.arange(-8,8,dt)
B = basis_gen(t,N)
D = differential_matrix(N)

f = np.sin(t)
a = np.linalg.lstsq(B.T,f)[0]
a[-3:] = 0
plt.figure()
plt.plot(t,B.T@a,label='input function')
plt.plot(t,B.T@(D@a),label='Derivative')

The basis generation follows neatly from our prior recurrence relation, but the differential matrix was more complicated. Largely, it follows from:$$\frac{d}{d x}\sum^n_{k=0} c_{k,n}x^k e^{-\frac{x^2}{2}} = \sum^n_{k=0} c_{k,n} \left(k x^{k-1} - x^{k+1}\right) e^{-\frac{x^2}{2}}$$Rearranging terms we see:$$\frac{d f_n\left(x\right)}{dx}-x f_n\left(x\right) = \sum^n_{k=0} c_{k,n} k x^{k-1} e^{-\frac{x^2}{2}}$$Using the identity for $f_{n}\left(x\right) = \left(-1\right)^{n-1} \cdot \sqrt{\frac{2}{n}} \cdot x \cdot f_{n-1}\left(x\right) + \sqrt{1-\frac{1}{n}} \cdot f_{n-2}\left(x\right)$ we can calculate the $x \cdot f_{n}\left(x\right)$ term, and eventually via some careful elimiation, produce:$$\frac{d f_n\left(x\right)}{dx} = -1^{n} \frac{1}{\sqrt{2}} \left(\sqrt{n} \cdot f_{n-1}-\sqrt{n+1} \cdot f_{n+1}\right)$$I also want to touch on one more trick. If we didn't have this algebra available to us, we could note that $\frac{d}{dx}\left(x^n e^{-\frac{x^2}{2}}\right) = n x^{n-1} e^{-\frac{x^2}{2}} - x^{n+1} e^{-\frac{x^2}{2}}$. We could easily construct a differentiation matrix $D$ so that we could differentiate a vandermonde-type matrix, $V$, of these monomials. In our case we note that our family of functions, $F$, can be found by multiplying such a matrix by its associated matrix of coefficients, $VC = F$. Therefore, $VDC$ would be equal to the matrix of derivatives of our functions. This is inefficient though, and hard to work with. Instead, what we want is a differentiation matrix $H$ such that $VCH$ is equal to the derivatives of our functions. As $VDC=VCH$, and since $C$ is invertible we can see $H = C^{-1}DC$, which we can then numerically evaluate if needed, or even use to work out the analytic result.

There is a key note. The last coefficient should be 0 when a differentiation operation is undertaken. This padding is needed as the powers of $x$ involved increase by $1$ for every differentiation. I would also strongly recommend being careful with using the inner product trick to fit coefficients. If your sample range isn't large enough you won't well approximate the integral over the whole of the reals! 

Questions on top of mind

  1.  Given a polynomial of the form $\sum a_n x_n^n$, are we able to guarantee the existance of a second polynomial $\sum b_n x_n^n$ such that $\sum b_n x_n^n = |\sum a_n x_n^n|$? This is of interest when considering using squared polynomials to force a function to have positive values. We may want to talk about the log of a squared polynomial, and for this we definitely do not want negative values.
  2. Does there exist a set of basis functions $h_n(x)$ such that when given a linear combination $\sum c_n h_n(x)$ we precisely know the minimum value?
  3. Taking a pdf and multiplying it by $e^{x^2}$ and then taking the square root, we have something that our basis functions from yesterday can interpolate. What does this modified pdf on its own tell us? This is a nice way to map PDFs to general functions, and a bijection to $C^{\infty}$ when the pdf is in $C^{\infty}$

March thoughts

Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-...