Fiber-Net ramble

 I've had this idea for a while, but haven't built one. Essentially, the common version of the universal approximation theorem states that we can arbitrarily well approximate any function given that we have unlimited linear operations and a nonlinearity of choice.

Signals in optical fiber can be filtered, meaning their signal is decreased by an arbitrary percentage. They can be split and merged, and non-linear phenomena like the Kerr effect exist in practice (https://www.nature.com/articles/s41467-023-41377-5). My suspicion is that this is enough to arbitrarily well approximate any function up to a scalar constant, and can be proven by taking standard neural network construction and simply multiplying the output down by a scalar factor at each layer in order that the 'increasing' components in an output stay the same or lower magnitude at the output.

What you would get in practice is a neural network capable of evaluating instantly and with almost unlimited throughput at no-cost. Particularly with CNNs and other image ingestion tools I can see this working very well.

Practical PDFs for optimization

 Having explored the landscape, I've settled on what is likely to be the most efficient way to do this. Via Hilbert and co there was a lot of old-timey exploration into polynomials that are positive everywhere. For a polynomial of one variable, the polynomial is positive everywhere iff there exists a rank-2 symmetric positive definite matrix $M$ such that for vandermonde type $z$, $z=\left[1,x,x^2,x^3,\dots x^n\right]$ we have $z^TMz$ as our positive polynomial of order $2n$. We could go about the process of determining what the coefficients would be in the context of a basis of order $2n$, but this would result in a cost of $2n$ for every evaluation as compared to the squaring followed by $2$ evaluations of cost $n$. It would require larger basis matrices to be stored and processed and could result in slight numerical error, a terrible thing when working with logs as is common in this context. This is not to say it would be bad if a smart person were to tackle it, but it is a chewy problem which requires a good deal more thought than I am able to justify. There is also the idea that maybe $R$ could be chosen in such a way that it would force positive definiteness in certain conditions. Alas, I don't desire this.

Anyway, long story short, we have landed on the following definition as that which is easiest for the computer to handle in the context of optimization problems:$$ P\left(x|C,\lambda\right) = \left( B\left(x\right)^T C\right )^2 \begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix} \text{ for } \lambda \in \left[0,1\right], C \in S^n \times S^n$$Easy enough huh? Derivatives play nicely as well:$$\frac{d P\left(x|C,\lambda\right)}{d x} = 2 \left( B\left(x\right)^T C * B\left(x\right)^T D C\right )\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$Given that:$$\frac{d^n}{dx^n} (f(x))^2 = \sum_{k=0}^{n} \binom{n}{k} \frac{d^k}{dx^k} 2f(x) \cdot \frac{d^{n-k}}{dx^{n-k}} f(x)$$We can then say:$$\frac{d^n}{dx^n} P\left(x|C,\lambda\right) = 2 \left(\sum_{k=0}^{n} \binom{n}{k} B\left(x\right)^T D^{k} C * B\left(x\right)^T D^{n-k} C\right)\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$In practice when taking logs it will be useful to exploit the log1p function using the form log1p(pdf/eps)+N*log(eps) which will protect you from any very small numerical errors.

If we really cared about performance, we could also be a little sneakier and look at $B$. You'll notice that $B$ is very ill-conditioned if the evaluation points $x$ are spread over a small region. This is a feature and not a bug! Taking the SVD of $B$ and dropping small singular values, we can derive
$$B^{m \times n} \approx U^{m \times k} \Sigma^{k \times k} V^{T,k \times n}$$While our small singular values are very small, our large ones are very well behaved, and our error doing this all is rather low. For example, over the interval $[-1.5,1.5]$, the relative error norm for $k=40$ when $n$ is $200$ is $\sim 1.5 \times 10^{-13}$. Since differentiation is all done spectrally in the coefficients, the error does not grow with differentiation.

This is also a good guideline to determine if your data needs to be rescaled. If you don't have any null-space in $B$ then you're likely to be unable to well fit your data.

Infinite basis non-square properties

It occurred to me that since the terms have finite integrals, we might be able to determine the integral of the function based on a linear combination of the coefficients. Using the tools from the method of moments we can pretty quickly confirm that the odd values are 0 and that the even values are of the form:$$\int_{-\infty}^{\infty}f_{2k}\left(x\right) dx = \sqrt{2 \sqrt{\pi} \frac{\left(2k-1\right)!!}{\left(2k \right)!!}}$$Since we know that $k!=k!!(k-1)!!$ we can recast $\frac{\left(2k-1\right)!!}{\left(2k \right)!!}$ as $frac{\left(2k\right)!}{\left(2k\right)!!^2}$. Since $2k!!=k!2^{k}$ we then can convert this to ${2k\choose k} 2^{-k}$. Thus we get our nice bit of code:

def integral_table(N):
    k = np.arange(0,N)
    k[1::2] = 0
    Ita = (1-(k)%4)*np.sqrt((comb(k,k/2)*(2.0**(1-k)))-2*(k==0))*np.sqrt(np.sqrt(np.pi))
    Ita[0] = np.sqrt(2*np.sqrt(np.pi))
    return Ita

Thus we can just multiply the vector from the integration table $\phi$ by the vector of coefficients to get the integral of the whole function as $\phi^Tc$. This means if we could determine conditions that forced positivity, we could ensure integrability to $1$.

Spectral differentiation for PDF basis with infinite support

Using the family of functions we defined a couple days ago of the form $f_n\left(x\right) = \sum^n_{k=0} c_{k,n} x^k e^{-\frac{x^2}{2}}$ we can produce some nice numerically stable methods for generating the basis and taking the derivative in spectral fashion. First, the code, so that you can verify it works.

import numpy as np
import matplotlib.pyplot as plt

def basis_gen(t,N):
    w = np.exp(-(t**2)/2)/(np.sqrt(np.sqrt(np.pi)))
    B = np.zeros(shape=(N,len(w)))
    B[:2,:] = np.vstack((w,-np.sqrt(2)*t*w))
    for k in range(2,N):
        B[k,:] = np.sqrt(1-(1/k))*B[k-2,:]
        if k%2:
            B[k,:] -= t*np.sqrt(2/k)*B[k-1,:]
        else:
            B[k,:] += t*np.sqrt(2/k)*B[k-1,:]
    return B

def differential_matrix(N):
    H = np.diag((0.5-np.arange(N-1)%2)*np.sqrt(np.arange(1,N)*2),k=1)
    H -= H.T
    return -H

dt = 0.001
N = 101
t = np.arange(-8,8,dt)
B = basis_gen(t,N)
D = differential_matrix(N)

f = np.sin(t)
a = np.linalg.lstsq(B.T,f)[0]
a[-3:] = 0
plt.figure()
plt.plot(t,B.T@a,label='input function')
plt.plot(t,B.T@(D@a),label='Derivative')

The basis generation follows neatly from our prior recurrence relation, but the differential matrix was more complicated. Largely, it follows from:$$\frac{d}{d x}\sum^n_{k=0} c_{k,n}x^k e^{-\frac{x^2}{2}} = \sum^n_{k=0} c_{k,n} \left(k x^{k-1} - x^{k+1}\right) e^{-\frac{x^2}{2}}$$Rearranging terms we see:$$\frac{d f_n\left(x\right)}{dx}-x f_n\left(x\right) = \sum^n_{k=0} c_{k,n} k x^{k-1} e^{-\frac{x^2}{2}}$$Using the identity for $f_{n}\left(x\right) = \left(-1\right)^{n-1} \cdot \sqrt{\frac{2}{n}} \cdot x \cdot f_{n-1}\left(x\right) + \sqrt{1-\frac{1}{n}} \cdot f_{n-2}\left(x\right)$ we can calculate the $x \cdot f_{n}\left(x\right)$ term, and eventually via some careful elimiation, produce:$$\frac{d f_n\left(x\right)}{dx} = -1^{n} \frac{1}{\sqrt{2}} \left(\sqrt{n} \cdot f_{n-1}-\sqrt{n+1} \cdot f_{n+1}\right)$$I also want to touch on one more trick. If we didn't have this algebra available to us, we could note that $\frac{d}{dx}\left(x^n e^{-\frac{x^2}{2}}\right) = n x^{n-1} e^{-\frac{x^2}{2}} - x^{n+1} e^{-\frac{x^2}{2}}$. We could easily construct a differentiation matrix $D$ so that we could differentiate a vandermonde-type matrix, $V$, of these monomials. In our case we note that our family of functions, $F$, can be found by multiplying such a matrix by its associated matrix of coefficients, $VC = F$. Therefore, $VDC$ would be equal to the matrix of derivatives of our functions. This is inefficient though, and hard to work with. Instead, what we want is a differentiation matrix $H$ such that $VCH$ is equal to the derivatives of our functions. As $VDC=VCH$, and since $C$ is invertible we can see $H = C^{-1}DC$, which we can then numerically evaluate if needed, or even use to work out the analytic result.

There is a key note. The last coefficient should be 0 when a differentiation operation is undertaken. This padding is needed as the powers of $x$ involved increase by $1$ for every differentiation. I would also strongly recommend being careful with using the inner product trick to fit coefficients. If your sample range isn't large enough you won't well approximate the integral over the whole of the reals! 

Questions on top of mind

  1.  Given a polynomial of the form $\sum a_n x_n^n$, are we able to guarantee the existance of a second polynomial $\sum b_n x_n^n$ such that $\sum b_n x_n^n = |\sum a_n x_n^n|$? This is of interest when considering using squared polynomials to force a function to have positive values. We may want to talk about the log of a squared polynomial, and for this we definitely do not want negative values.
  2. Does there exist a set of basis functions $h_n(x)$ such that when given a linear combination $\sum c_n h_n(x)$ we precisely know the minimum value?
  3. Taking a pdf and multiplying it by $e^{x^2}$ and then taking the square root, we have something that our basis functions from yesterday can interpolate. What does this modified pdf on its own tell us? This is a nice way to map PDFs to general functions, and a bijection to $C^{\infty}$ when the pdf is in $C^{\infty}$

PDF with infinite basis

I used my usual tricks to show that you can define a family of orthonormal functions via the following recurrence relation:$$f_{n}\left(x\right) = \left(-1\right)^{n-1} \cdot \sqrt{\frac{2}{n}} \cdot x \cdot f_{n-1}\left(x\right) + \sqrt{1-\frac{1}{n}} \cdot f_{n-2}\left(x\right) $$$$f_{0}\left(x\right) = \pi^{-\frac{1}{4}} \cdot e^{-\frac{x^2}{2}}, f_{1}\left(x\right) = - \pi^{-\frac{1}{4}} \cdot \sqrt{2} \cdot x \cdot e^{-\frac{x^2}{2}}$$This is great for probability density functions as $\int^{\infty}_{-\infty} \left(\sum c_k f_k\right)^2 = \sum c_k^2$, meaning that if the norm of the coefficient vector is 1, then the square of that combination is guaranteed to be a pdf.

The code is below, but I'd like to go through the solution method I used to derive this. First, I used the formula for the $n$-th moment of a gaussian to derive the inner products of every function of the form $x^n \cdot e^{-\frac{x^2}{2}}$. For appropriate sigma and mu, this is given by $\int^{\infty}_{-\infty} x^n \cdot e^{-\frac{x^2}{2}} \cdot x^n \cdot e^{-\frac{x^2}{2}} dx = \Gamma\left(\frac{i+j+1}{2}\right)\cdot \left(1-((i+j)|2)\right)$. I created an $N \times N$ matrix of these terms to get the inner-products they share with each other. I then applied the QR-reshaping code from the other post to solve:

u,s,v = np.linalg.svd(inner_products)
h = (u/np.sqrt(s))
Q,R = flipper(h,'tr',flipped=True)

$R$ is the set of desired desired coefficient vectors to create this basis. It is however ugly, and to avoid using my brain I did the following; first, I saw a square root 2 term, and I squared the matrix of coefficients. From here, I converted these representations to fractions. Seeing that this worked well, I then identified the integer term each row could be multiplied by to force them to be integers. With that removed, it was simply a matter of identifying patterns in the integer rows. Knowing that we wanted an efficient solution in the form of a recurrence relation, and seeing repeated values, I shifted those out and used the row before, this presented the solution that could be used in the code.

dt = 0.001
theta = 0.5
N = 1000
t = np.arange(-8,8,dt)
w = np.exp(-(t**2)*theta)/(np.sqrt(np.sqrt(np.pi)))
B = np.zeros(shape=(N,len(w)))
B[:2,:] = np.vstack((w,-np.sqrt(2)*t*w))
for k in range(2,N):
    B[k,:] = np.sqrt(1-(1/k))*B[k-2,:]
    if k%2:
        B[k,:] -= t*np.sqrt(2/k)*B[k-1,:]
    else:
        B[k,:] += t*np.sqrt(2/k)*B[k-1,:]
print(np.allclose(B @ B.T,np.eye(N)/dt))
f = np.sin(t)
b = B @ f * dt
plt.plot(t,B.T@b)

General estimates (short)

 Just a quick note, I find it disappointing that so many point-wise estimators are used in general. If we have a given metric for error, $d\left(y,\hat{y}\right)$, and a prior and likelihood, it seems intuitive that optimization should focus on solving:$$\underset{\hat{y}}{\mathrm{argmin}} \int d\left(y,\hat{y}\right) p\left(x|y\right)p\left(y\right)dy$$But usually you don't see it cast like this, which is a shame because you can end up with the wrong peak and bad overfit. Plus, you can put all sorts of fun regularizers in the prior, such as variation minimizers. Anyway, bad post, just was living on my mind.

Least squares with fixed coefficient magnitude

Quick and easy, here's some code that performs least squares, with the condition that the coefficient vector has a particular magnitude. I offer no proof for this, I just had the intuition that it would work and it did. It's an intermediary step in some other code but it's useful all by itself.

def sphere_least_squares(X,b,magnitude=1):
    u,s,vt = np.linalg.svd(X, full_matrices=False)
    scale_coeffs = (u.T @ b)/s
    sq_scale_coeffs = (scale_coeffs)**2
    lc = np.argwhere(np.cumsum((sq_scale_coeffs))>magnitude)[0,0]#first_coeff exceeding magnitude
    scale_coeffs[lc] *= np.sqrt((magnitude-np.sum(sq_scale_coeffs[:lc]))/sq_scale_coeffs[lc])
    c = vt.T[:,:(lc+1)] @ scale_coeffs[:(lc+1)]
    return c

March thoughts

Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-...