weighted decay code

Here's some code to generate those basis functions from yesterday. As you might imagine, you need to increase the decay rate as you ask for more basis functions. I didn't least squares it, just applied the usual trick with inner product to get the true parameter values for the basis functions. If you were anything other than the most truly lazy of gods creations you would derive an upper triangular form for c iteratively. You might also have the good sense to perform any of a million things that would make this more stable, such as using x^k/k! as the elements we multiply the coefficients against for instance. Or maybe exploiting the alpha parameters in the gamma function so this scales right.

My own failings aside you would be wrong to criticize this on the condition number of G in comparison to any polynomial based method. We're integrating over infinite distance, not the finite distances a polynomial asks of us, it's a different problem. If you call the fit bad in the final plot I'll ask you to have some grace as this is recovery of the coefficients for the true parameters in the infinite series, not the line of best fit.



import numpy as np
import scipy.special as sp

#Generate matrix of inner products of all functions up to order n-1
#of form x^(k)e^-theta*x over 0,inf
n = 10
x = np.arange(n)
X = x[:,np.newaxis] + x[np.newaxis,:] + 1
theta = 1
Gl = sp.gammaln(X)-(X*np.log(2*theta))
G = np.exp(Gl)

#Use matrix of inner products to determine coefficients for
#orthonormal basis
u,s,v = np.linalg.svd(G)

c = (u/np.sqrt(s))

#Evaluate othonormal basis functions
dt = 0.001
t = np.arange(dt,30,dt)
V = np.vander(t,N=n,increasing=True)
w = np.exp(-t*theta)
P = V*w[:,np.newaxis]
y = P @ (c)

#Plot basis functions
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 2)
axs[0, 0].plot(t,y)
axs[0, 0].set_title('Basis functions')
print('integral', np.sum(y**2,axis=0)*dt)

#Demonstrate fitting property
f = np.sin(t)
b = y.T @ f * dt

#Plot fitted data
axs[1, 0].plot(t[:int(len(t)/3)],(y@b)[:int(len(t)/3)])
axs[1, 0].plot(t[:int(len(t)/3)],f[:int(len(t)/3)])
axs[1, 0].set_title('Local fit')
axs[1, 1].plot(t,(y@b))
axs[1, 1].plot(t,f)
axs[1,1].set_title('\n Full forecast')


Decaying approximation

I'm trying to approximate a function $f:\mathbb{R}^+\to\mathbb{R}$. I would like my approximation to be well behaved and have bounded derivatives of all orders. Is there a good reason why I should not use a group of functions of the form $P(x)e^{-x}$ to create a basis?

I know that it can form a basis as for any analytic $f(x)$ we see that $f(x)e^{-x}$ is also analytic. Thus as we have $\sum c_n x^n = f(x)e^x$ then $\sum c_n x^ne^{-x} = f(x)$. It can approximate anything, but then what happens around 0? This also produces an issue where we weight point closer to 0 heavier, but perhaps in some cases this is not a problem. We can also expand the series definition of $e^{-x}$ to get the actual coefficient values, so no real problems there in the analysis.

Still, I don't like it, except maybe as a means to construct probability distributions given the relationship to the gamma distribution, with integration to $1$ being a linear condition. I feel like this could be used to generate them by generating chebyshev-style polynomials that cancel out in a clean way. But then again, why do I care? Nothing is analytic these days.

Problem-setup: point-distribution grading

 Having finished the hull finding examples I started thinking if we really needed them. I care about them mostly in the context of knowing when data is interpolable, but really it may not be the best measure for if something can be well approximated. With scattered data something can be outside of the hull and close to a lot of dense information, or it can be in the hull and far from anything. Settling for density metrics isn't enough though, because we benefit from a wide distribution of samples, many taken at the same nearby point don't give us as much information as a smaller number dispersed into the local region.

I looked far and wide and honestly couldn't come up with a good candidate measure for n-dimensions. What I can do instead is try to draft a clear example of what I mean so that I don't suffer a bunch of smart people telling me to use a bunch of methods that don't quite work for hard to explain reasons.

To give ourselves a nice easy case, we're going to assume that we're working in the space of analytic functions $\mathbb{R}^n \to \mathbb{R}$. Since we care quite a bit about the dispersion of data, we're going to do something a little rouge and work with polar coordinates. Note that $\arccos\left(\frac{x}{\|x\|}\right) = \theta$ where $\theta$ is the vector of angles of the point $x$ relative to the standard basis vectors. Translating in and out of polar coordinates in an n-dimensional setting isn't as awful as it sounds.

Since everything is analytic, we can express it in terms of $f(r,\theta) = \sum r^k g_k(\theta)$, no need to worry that we have one too many thetas, it doesn't increase our degrees of freedom anywhere so it keeps us locked in. If we had some sample values $y$ at a set of points given by a vector of $r$'s and a matrix of $\theta$s we can put together $y = V(\vec{r}) \cdot G_{\gamma}\left(\theta\right)$, where $V$ is an (infinitely fat) vandermonde matrix of those $r$'s and $G$ is an (equally fat) matrix of the various $g_k$ values for the involved $\theta$s. Note that $g_k: S^n \to \mathbb{R}$ takes a vector of inputs and maps to one number, so no tensors crop up.

Anyway, if you want to fit $\gamma$ to it you do the standard trick where you take the derivative of the squared norm of the difference and set it to 0:

$$\displaystyle E(\gamma) = \|V(\vec{r}) \cdot G_{\gamma}\left(\theta\right) - y\|^2$$

$$\displaystyle  \frac{\partial E}{\partial \gamma}= 0 = \left(V(\vec{r}) \cdot G_{\gamma}\left(\theta\right) - y\right)^T\left(V(\vec{r}) \cdot \frac{\partial}{\partial \gamma}G_{\gamma}\left(\theta\right)\right)$$
Now at least we have a framework for what our approximation error can start to look like. If we're feeling frisky we can also note that $g_0\left(\theta\right) = c$ and $g_1\left(\theta\right) = \vec{a} \cdot \cos\left(\theta\right)$ if we want to preserve our nice manifold behavior which would otherwise be disrupted at the origin. Tomorrow if I get a moment we can try the simple case to establish worst case behavior for a linear approximation and certain bounds on the variation of the $g_k$ terms (See how it being periodic will come in handy? bounded variation means bounded norms means a whole lot of delta-epsilon n-ball stuff).

Wait could this be the wrong approach? We shouldn't be trying to find the linear function minimizing error everywhere, we only want to minimize error at our point of interest. Does the model that best fits the data present the best estimate of a particular parameter? Could there be systematic bias to offset other terms like the r squared one? What if all the $g$'s after were zero for odd $k$ and positive for even $k$. We could theoretically fit a higher dim function then pass along the first terms from that which would improve in all cases.

Faster ReLUs

I'll expand upon this a little more later, but I've used the following trick in the past to speed up computations.

When evaluating a ReLU it essentially takes the form of:

$$\displaystyle \text{ReLU}\left(\vec{x}\right)_j = h\left(\langle \vec{a}_j | \vec{x} \rangle + b_j\right)$$

Where $h(z) = z$ if $z>0$ and returns $0$ otherwise. A ReLU being non-zero implies the input point $x$ lies in a certain half-plane. If we know that several ReLUs are a mix of zeros and non-zeros, it implies that the point lies in a specific region given by the intersections of those half-planes. If we know that other ReLU operations we haven't evaluated have a half plane outside that region, then we know they will evaluate to 0 before the evaluation has taken place, meaning we don't need to perform them.

In effect, we look to evaluate any given ReLU block in stages, where each stage will tell us which operations do and don't need to be performed. Creating this map of dependencies is computationally expensive, but only expensive once. After it has been constructed it speeds up all future ReLU operations, making it useful for production.


Direct pseudoinverse on a different decomposition

 We hold the following truths to be self-evident (for real-matrices):

$$\displaystyle A = U \Sigma V^T$$

$$\displaystyle A^+=V \Sigma^{-1} U^T$$

$A^+$ is (up to quibbling) unique, and we see that it has a condition number $\kappa(A^+)=\kappa(\Sigma^{-1})$. When calculating a pseudoinverse from an SVD, we multiply all these components together, but all our instability is inherently baked into the central $\Sigma^{-1}$ term. Now, we might speculate, suppose we calculated $\Sigma^{-0.5}$, our condition number for this is the square root of that of $Sigma^{-1}$. With $U$ and $V$ being unitary we can* see that $V \Sigma^{-0.5}$ and $\Sigma^{-0.5} U^T$ have these same square root condition numbers and have only undergone a square root condition number multiplication operation. Multiplying these products together we then get a much more numerically stable calculation. A lot of handwaving went on there but feel free to verify in python that:

$$\displaystyle \textbf{EVAL: }\|A(V \Sigma^{-0.5})(\Sigma^{-0.5} U^T)A - A\| \leq \|A((V \Sigma^{-1}) U^T)A-A\|$$

To reiterate the above statement is a literal equality, but numerically there is a difference. Because pseudoinverses are often calculated using SVDs, it might make more sense to directly calculate $(V \Sigma^{-0.5})$ and $(\Sigma^{-0.5} U^T)$. Let us denote $L = (V \Sigma^{-0.5})$ and $R = (\Sigma^{-0.5} U^T)$.

We can calculate $L$ and $R$ directly by noting that $RAL=I$ and $L^TL = R R^T = \Sigma^{-1}$. We also are aided via $\|row(R,k)\| = \|col(L,k)\|$.

Modifying the standard householder based SVD algorithm, we can actually constrain it to avoid calculating the central bidiagonal matrix, instead creating an identity matrix via successive updates to $R$ and $L$ that diminish the deviation of $RAL-I$. Even if done in a fashion where $R$ and $L$ don't have their diagonalizable property or an equal distribution of singular values, as long as $RAL=I$ we find that $LR$ is still such that $LRx = A^+ x$.

There are some nice group properties to go along with this. When A is nonsingular and square we see that $RAL = LRA = ALR = I$, thus there are exact inverses for R, A, and L. Now the trick would be seeing if we can form a ring out of this. I have a sneaking suspicion that we can, and I have a copy of Hungerford that probably has the right lemma for it somewhere. I'll examine the algebraic properties in a later post.

*(I shouldn't need to prove this but if I must, just note that $V\Sigma^{-0.5} = V\Sigma^{-0.5}I$, which is of course a standard SVD format that we know carries the condition numbers in the central term.)

Fast semi-orthogonal matrix from a vector

Given a vector $\vec{x}$ with unit magnitude, we want to find a $m \times n$ matrix B such that $BB^T = I$, with first row $\vec{B}_0 = \vec{x}$. To do this, we calculate $\vec{p}=\frac{\vec{x}+\vec{e}_0}{\sqrt{1+x_0}}$ where $x_0$ is the first element of $\vec{x}$ and $\vec{e}_0$ is the first standard basis vector. We then calculate $B$ with an outer product operation as $\vec{p}_{k \leq m}\vec{p}^T - I$, where $\vec{p}_{k \leq m}$ is $\vec{p}$ truncated to the first $m$ elements.

$$\displaystyle \vec{p}=\frac{\vec{x}+\vec{e}_0}{\sqrt{1+x_0}}$$

$$\displaystyle B=\vec{p}_{k \leq m}\vec{p}^T - I$$

This is obviously of the order $(nm)/2$ operations, but by making creative use of pointers this can be even faster. Note each column is independently calculable as scalar multiplication of $\vec{p}$ followed by a single element change. Passing that scalar through to following operations allows for working with just $\vec{p}$ with a single element modification, effectively making the method an $O(m)$ cost when applied right.

March thoughts

Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-...