I've had this idea for a while, but haven't built one. Essentially, the common version of the universal approximation theorem states that we can arbitrarily well approximate any function given that we have unlimited linear operations and a nonlinearity of choice.
Signals in optical fiber can be filtered, meaning their signal is decreased by an arbitrary percentage. They can be split and merged, and non-linear phenomena like the Kerr effect exist in practice (https://www.nature.com/articles/s41467-023-41377-5). My suspicion is that this is enough to arbitrarily well approximate any function up to a scalar constant, and can be proven by taking standard neural network construction and simply multiplying the output down by a scalar factor at each layer in order that the 'increasing' components in an output stay the same or lower magnitude at the output.
What you would get in practice is a neural network capable of evaluating instantly and with almost unlimited throughput at no-cost. Particularly with CNNs and other image ingestion tools I can see this working very well.
Fiber-Net ramble
Practical PDFs for optimization
Having explored the landscape, I've settled on what is likely to be the most efficient way to do this. Via Hilbert and co there was a lot of old-timey exploration into polynomials that are positive everywhere. For a polynomial of one variable, the polynomial is positive everywhere iff there exists a rank-2 symmetric positive definite matrix $M$ such that for vandermonde type $z$, $z=\left[1,x,x^2,x^3,\dots x^n\right]$ we have $z^TMz$ as our positive polynomial of order $2n$. We could go about the process of determining what the coefficients would be in the context of a basis of order $2n$, but this would result in a cost of $2n$ for every evaluation as compared to the squaring followed by $2$ evaluations of cost $n$. It would require larger basis matrices to be stored and processed and could result in slight numerical error, a terrible thing when working with logs as is common in this context. This is not to say it would be bad if a smart person were to tackle it, but it is a chewy problem which requires a good deal more thought than I am able to justify. There is also the idea that maybe $R$ could be chosen in such a way that it would force positive definiteness in certain conditions. Alas, I don't desire this.
Anyway, long story short, we have landed on the following definition as that which is easiest for the computer to handle in the context of optimization problems:$$ P\left(x|C,\lambda\right) = \left( B\left(x\right)^T C\right )^2 \begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix} \text{ for } \lambda \in \left[0,1\right], C \in S^n \times S^n$$Easy enough huh? Derivatives play nicely as well:$$\frac{d P\left(x|C,\lambda\right)}{d x} = 2 \left( B\left(x\right)^T C * B\left(x\right)^T D C\right )\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$Given that:$$\frac{d^n}{dx^n} (f(x))^2 = \sum_{k=0}^{n} \binom{n}{k} \frac{d^k}{dx^k} 2f(x) \cdot \frac{d^{n-k}}{dx^{n-k}} f(x)$$We can then say:$$\frac{d^n}{dx^n} P\left(x|C,\lambda\right) = 2 \left(\sum_{k=0}^{n} \binom{n}{k} B\left(x\right)^T D^{k} C * B\left(x\right)^T D^{n-k} C\right)\begin{bmatrix}\lambda \\ 1-\lambda \end{bmatrix}$$In practice when taking logs it will be useful to exploit the log1p function using the form log1p(pdf/eps)+N*log(eps) which will protect you from any very small numerical errors.
If we really cared about performance, we could also be a little sneakier and look at $B$. You'll notice that $B$ is very ill-conditioned if the evaluation points $x$ are spread over a small region. This is a feature and not a bug! Taking the SVD of $B$ and dropping small singular values, we can derive$$B^{m \times n} \approx U^{m \times k} \Sigma^{k \times k} V^{T,k \times n}$$While our small singular values are very small, our large ones are very well behaved, and our error doing this all is rather low. For example, over the interval $[-1.5,1.5]$, the relative error norm for $k=40$ when $n$ is $200$ is $\sim 1.5 \times 10^{-13}$. Since differentiation is all done spectrally in the coefficients, the error does not grow with differentiation.
This is also a good guideline to determine if your data needs to be rescaled. If you don't have any null-space in $B$ then you're likely to be unable to well fit your data.
Infinite basis non-square properties
def integral_table(N):
k = np.arange(0,N)
k[1::2] = 0
Ita = (1-(k)%4)*np.sqrt((comb(k,k/2)*(2.0**(1-k)))-2*(k==0))*np.sqrt(np.sqrt(np.pi))
Ita[0] = np.sqrt(2*np.sqrt(np.pi))
return Ita
Spectral differentiation for PDF basis with infinite support
Questions on top of mind
- Given a polynomial of the form $\sum a_n x_n^n$, are we able to guarantee the existance of a second polynomial $\sum b_n x_n^n$ such that $\sum b_n x_n^n = |\sum a_n x_n^n|$? This is of interest when considering using squared polynomials to force a function to have positive values. We may want to talk about the log of a squared polynomial, and for this we definitely do not want negative values.
- Does there exist a set of basis functions $h_n(x)$ such that when given a linear combination $\sum c_n h_n(x)$ we precisely know the minimum value?
- Taking a pdf and multiplying it by $e^{x^2}$ and then taking the square root, we have something that our basis functions from yesterday can interpolate. What does this modified pdf on its own tell us? This is a nice way to map PDFs to general functions, and a bijection to $C^{\infty}$ when the pdf is in $C^{\infty}$
PDF with infinite basis
I used my usual tricks to show that you can define a family of orthonormal functions via the following recurrence relation:$$f_{n}\left(x\right) = \left(-1\right)^{n-1} \cdot \sqrt{\frac{2}{n}} \cdot x \cdot f_{n-1}\left(x\right) + \sqrt{1-\frac{1}{n}} \cdot f_{n-2}\left(x\right) $$$$f_{0}\left(x\right) = \pi^{-\frac{1}{4}} \cdot e^{-\frac{x^2}{2}}, f_{1}\left(x\right) = - \pi^{-\frac{1}{4}} \cdot \sqrt{2} \cdot x \cdot e^{-\frac{x^2}{2}}$$This is great for probability density functions as $\int^{\infty}_{-\infty} \left(\sum c_k f_k\right)^2 = \sum c_k^2$, meaning that if the norm of the coefficient vector is 1, then the square of that combination is guaranteed to be a pdf.
The code is below, but I'd like to go through the solution method I used to derive this. First, I used the formula for the $n$-th moment of a gaussian to derive the inner products of every function of the form $x^n \cdot e^{-\frac{x^2}{2}}$. For appropriate sigma and mu, this is given by $\int^{\infty}_{-\infty} x^n \cdot e^{-\frac{x^2}{2}} \cdot x^n \cdot e^{-\frac{x^2}{2}} dx = \Gamma\left(\frac{i+j+1}{2}\right)\cdot \left(1-((i+j)|2)\right)$. I created an $N \times N$ matrix of these terms to get the inner-products they share with each other. I then applied the QR-reshaping code from the other post to solve:
h = (u/np.sqrt(s))
Q,R = flipper(h,'tr',flipped=True)
$R$ is the set of desired desired coefficient vectors to create this basis. It is however ugly, and to avoid using my brain I did the following; first, I saw a square root 2 term, and I squared the matrix of coefficients. From here, I converted these representations to fractions. Seeing that this worked well, I then identified the integer term each row could be multiplied by to force them to be integers. With that removed, it was simply a matter of identifying patterns in the integer rows. Knowing that we wanted an efficient solution in the form of a recurrence relation, and seeing repeated values, I shifted those out and used the row before, this presented the solution that could be used in the code.
theta = 0.5
N = 1000
t = np.arange(-8,8,dt)
w = np.exp(-(t**2)*theta)/(np.sqrt(np.sqrt(np.pi)))
B = np.zeros(shape=(N,len(w)))
B[:2,:] = np.vstack((w,-np.sqrt(2)*t*w))
for k in range(2,N):
B[k,:] = np.sqrt(1-(1/k))*B[k-2,:]
if k%2:
B[k,:] -= t*np.sqrt(2/k)*B[k-1,:]
else:
B[k,:] += t*np.sqrt(2/k)*B[k-1,:]
print(np.allclose(B @ B.T,np.eye(N)/dt))
f = np.sin(t)
b = B @ f * dt
plt.plot(t,B.T@b)
General estimates (short)
Just a quick note, I find it disappointing that so many point-wise estimators are used in general. If we have a given metric for error, $d\left(y,\hat{y}\right)$, and a prior and likelihood, it seems intuitive that optimization should focus on solving:$$\underset{\hat{y}}{\mathrm{argmin}} \int d\left(y,\hat{y}\right) p\left(x|y\right)p\left(y\right)dy$$But usually you don't see it cast like this, which is a shame because you can end up with the wrong peak and bad overfit. Plus, you can put all sorts of fun regularizers in the prior, such as variation minimizers. Anyway, bad post, just was living on my mind.
Least squares with fixed coefficient magnitude
Quick and easy, here's some code that performs least squares, with the condition that the coefficient vector has a particular magnitude. I offer no proof for this, I just had the intuition that it would work and it did. It's an intermediary step in some other code but it's useful all by itself.
u,s,vt = np.linalg.svd(X, full_matrices=False)
scale_coeffs = (u.T @ b)/s
sq_scale_coeffs = (scale_coeffs)**2
lc = np.argwhere(np.cumsum((sq_scale_coeffs))>magnitude)[0,0]#first_coeff exceeding magnitude
scale_coeffs[lc] *= np.sqrt((magnitude-np.sum(sq_scale_coeffs[:lc]))/sq_scale_coeffs[lc])
c = vt.T[:,:(lc+1)] @ scale_coeffs[:(lc+1)]
return c
March thoughts
Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-...
-
Ok, here's the idea. Suppose I had a function $$f:X \to Y$$ and distance metrics $d_X:X \to \mathbb{R}$ and $d_Y:Y \to \mathbb{R}$. If ...
-
Here's some code to generate those basis functions from yesterday. As you might imagine, you need to increase the decay rate as you ask ...
-
Using the family of functions we defined a couple days ago of the form $f_n\left(x\right) = \sum^n_{k=0} c_{k,n} x^k e^{-\frac{x^2}{2}}$ we ...