Differential equations guarantees for function inverse

Suppose we had an invertible function $f:\mathbb{R} \to \mathbb{R}$, and suppose further we knew there was a differential equation that governed it such that $N\left[f\right] = 0$. A natural question to ask is "Does this mean a differential equation also governs the inverse?". Solving this is pretty straightforward. If $f(x) = y$ and $f^{-1}(y) = x$ we know that:$$\left[f^{-1}\right]'(y) = \frac{1}{f'(f^{-1}(y))}$$From here we invert both sides to define $f'(f^{-1}(y))$ as a function of $\left[f^{-1}\right]'(y)$:$$\frac{1}{\left[f^{-1}\right]'(y)} = f'(f^{-1}(y))$$From simple chain rule we know that $\left(f^{(n)}(f^{-1}(y))\right)' = f^{(n+1)}(f^{-1}(y))\left[f^{-1}\right]'(y)$. Thus we can say:$$\left(\frac{1}{\left[f^{-1}\right]'(y)}\right)^{(n)} = f^{(n+1)}(f^{-1}(y))\left[f^{-1}\right]'(y)$$And by rearranging:$$\frac{1}{\left[f^{-1}\right]'(y)}\left(\frac{1}{\left[f^{-1}\right]'(y)}\right)^{(n)} = f^{(n+1)}(f^{-1}(y))$$This gives us the means to generate any derivative of $f$ as a function of the derivatives of the inverse. Particularly, as the $\left(\frac{1}{\left[f^{-1}\right]'(y)}\right)^{(n)}$ term will always be a ratio of polynomials formed from powers of derivatives of the inverse, that means that if $N[f]=0$ is a polynomial of derivatives of $f$, then so will the differential equation that governs $f^{-1}$. Furthermore, it will be of the same order, giving us a certain 'niceness'. Note as well that $N[f]=0$ holds regardless of the path it is evaluated along.

N.B. When I talk about a 'polynomial of derivatives', I mean a combination of terms of the form $\prod_{k=0} \left(f^{(k)}\left(x\right)\right)^{\alpha_k}$, so something like $\left(f\right)^3\left(f'\right)^2-3.2\left(f''\right)+7f^9$

Distribution recovery (pt.1)

This series is going to be about a method to identify the family of distributions that a set of random samples have been drawn from. This one has to be multi-part because too many different things are involved here, but here's a general table of contents.

  1. Sort the data
  2. Note that the distance between sequential data-points is inversely related to the density function
  3. Note then that the changes in distance between successive points provides information on the derivative of the underlying density function
  4. Examine the data in the context of it's CDF, allowing for us to derive the probability of particular spacings given a particular PDF
  5. Compose a finite difference scheme and use it to construct a matrix $F$ that has provably expected values of $\left[p(x),p'(x),p''(x), \cdots \right]$
  6. Use the knowledge that particular probability distributions have specific families of ODEs that govern them, such that $N\left[p\right] = 0$
  7. Use the smallest singular vector of $F$ to identify an estimate of the specific differential equation governing the sample distribution
  8. Match this singular vector to one of the manifolds that contain a family of distributions
  9. Write angry letters to the guys at fitter
Seems simple enough right? Intuitively "The difference between sample points encodes information about the density and how it changes. When the density is governed by a differential equation, this will be reflected in an observable restriction on the differences between sample points".

Lets start with a set of $N$ samples from a distribution with a pdf $p: \left(X \in \mathbb{R}\right) \to \mathbb{R}^+$ and a cdf $c: \left(X \in \mathbb{R}\right) \to \left[0,1\right]$. We index these samples based on their value and then use this to create an ordered list $x = \left(x_1,x_2,\cdots , x_N \right)$ where $x_n \geq x_m$ if $n>m$.

Since we know that the values of $c(x)$ for our sample are uniformally distributed, it makes sense to look at the conditional probability of the values of $c_{n+1}$ given $c_n$ (Note that here we use the notation $c(x_n) = c_n$ for compactness). We see that asking for the probability of the next point being greater than a value $Z$ is equivalent to asking the probability that every following cdf sample which we know to be between $c_n$ and $1$ does not lie in the region between $Z$ and $c_n$. This is binomial, so we get: $$P(c_{n+1} \geq Z|c_n) = \left(\frac{\left(1-Z\right)}{\left(1-c_n\right)}\right)^{N-n}$$Taking the derivative, we then find that$$P(c_{n+1} = Z|c_n) = \left(N-n\right)\frac{\left(1-Z\right)^{N-n-1}}{\left(1-c_n\right)^{N-n}}$$This is then enough to create:$$P(c_{n+k},c_{n+k-1},\cdots , c_{n+1}|c_n) = \prod^{k}_{j=1} P(c_{n+j}|c_{n+j-1})$$ $$ \prod^{k}_{j=1} P(c_{n+j}|c_{n+j-1}) = \left(\prod^{k-1}_{j=0}\left(N-n-j\right)\right) \prod^{k}_{j=1}\frac{\left(1-c_{n+j}\right)^{N-n-j}}{\left(1-c_{n+j-1}\right)^{N-n-j+1}}$$ $$\left(\prod^{k-1}_{j=0}\left(N-n-j\right)\right) \prod^{k}_{j=1}\frac{\left(1-c_{n+j}\right)^{N-n-j}}{\left(1-c_{n+j-1}\right)^{N-n-j+1}} = \frac{\left(N-n\right)!}{\left(N-n-k\right)!}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}}$$This result might seem unintuitive, that the intermediary samples don't contribute to the probability of the sequence, but it's a natural result of any particular sequence in a uniform distribution being equally likely. The good part comes from the following question: "If we don't care about the intermediate values at all, how does this influence the results". That is, what is $P\left(c_{n+k}|c_n\right)$?  $$P\left(c_{n+k}|c_n\right) = \int P(c_{n+k},c_{n+k-1},\cdots , c_{n+1}|c_n) d c_{n+1} \cdots c_{n+k-1}$$ $$=\frac{\left(N-n\right)!}{\left(N-n-k\right)!}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}} \int 1 d c_{n+1} \cdots c_{n+k-1}$$This integral is the n-volume ($V_n$) of the feasible region that preserves the ordering, i.e. $c_{n+1} \cdots c_{n+k-1}$ such that $c_n \leq c_{n+j-1} \leq c_{n+j} \leq c_{n+j+1} \leq c_{n+k} \forall 1 \geq j < k$. This can be calculated inductively for $c_n = 0, c_{n+k} = 1$, with $V_{m+1} = \frac{V_{m}}{m+1}$ and $V_1 = 1$. This gives us $V_{m} = \frac{1}{m!}$, which when rescaled for the actual range of $c_n$ to $c_{n+k}$ via the factor $\left(c_{n+k}-c_n\right)^{k-1}$ gives us the following solution:$$P\left(c_{n+k}|c_n\right) =\frac{\left(N-n\right)!}{\left(k-1\right)!\left(N-n-k\right)!}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}}\left(c_{n+k}-c_n\right)^{k-1}$$ $$=k {N - n \choose k}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}}\left(c_{n+k}-c_n\right)^{k-1}$$
Now we can recast this in terms of the samples $x$. Recall that if $z = g(x)$ and $g(x)$ is monotonically increasing then $$p(z) = p(g^{-1}(z)) \frac{d}{dz} g^{-1}(z)$$Using this we can then note that as $c_n = c(x_n)$ and $x_n = c^{-1}(c_n)$ (N.B. that $c$ is always invertible in its domain) we then use our change of variables of $P\left(c_{n+k}|c_n\right)$ into $P\left(x_{n+k}|x_n\right)$. This grants us:$$P\left(x_{n+k}|x_n\right) = k {N - n \choose k}\frac{\left(1-c(x_{n+k})\right)^{N-n-k}}{\left(1-c(x_n)\right)^{N-n}}\left(c(x_{n+k})-c(x_n)\right)^{k-1}  \frac{d}{d x_{n+k}} c(x_{n+k})$$ $$ = k {N - n \choose k}\frac{\left(1-c(x_{n+k})\right)^{N-n-k}}{\left(1-c(x_n)\right)^{N-n}}\left(c(x_{n+k})-c(x_n)\right)^{k-1}  p(x_{n+k})$$.

Now the pdf has dropped out. In the next post we will go from the above $P\left(c_{n+k}|c_n\right)$ in terms of the cdf values and exploit the fact that the expectation of a function of $f(c_{n+k})$ can be reformulated as a Jocobi integral transform to work with it.

Arbitrarily reshaped QR code

 I didn't post it earlier, but here's the code that creates arbitrarily shaped QR-type decompositions. The code is on a Githuib gist here but I'm putting the full code here because there's always a risk github removes the gist functionality or it in some way is damaged.

You can read the comment code to try to understand exactly what this function does, but I'd say it's easier just to run it and look at the output matrices to understand what happens. Note that if 'flipped' is set to True then while the function will still return the decomposed components in the order Q,R, it will be the case that A = RQ. 

def flipper(A,corner,flipped=False):
    """
    Arbitrary shape QR style decompositions
    Parameters
    ----------
    A : array_like
        Input matrix to be decompoosed.
    corner : String
        The 'dense' corner of the triangular matrix. E.g. top-right in a standard 
        upper triangular matrix. One of 'bl','tr','br','tl' meaning bottom left, 
        top right, etc.
    flipped : Boolean
        Determines if the decomposition should be Orthogonal Triangular, or, if 
        True, Triangualr Orthogonal.
    Returns
    -------
    Q : array_like
        Orthogonal Matrix.
    R : array_like
        Triangular Matrix.
    """
    if flipped:
        A = A.T
        if 't' in corner:
            A = np.flip(A,axis=1)
        Q,R = np.linalg.qr(A)
        Q = Q.T
        R = R.T
        if 'r' in corner:
            Q = np.flip(Q,axis=0)
            R = np.flip(R,axis=1)
        if 't' in corner:
            R = np.flip(R,axis=0)
    else:
        if 'l' in corner:
            Q,R = np.linalg.qr(np.flip(A,axis=1))
            R = np.flip(R,axis=1)
        else:
            Q,R = np.linalg.qr(A)
        if 'b' in corner:
            Q = np.flip(Q,axis=1)
            R = np.flip(R,axis=0)
    return Q,R

Function estimation

 Ok, here's the idea. Suppose I had a function $$f:X \to Y$$ and distance metrics $d_X:X \to \mathbb{R}$ and $d_Y:Y \to \mathbb{R}$. If I assumed this function was Lipschitz continuous with a Lipschitz constant $L$ I could say $$d_Y(f(x),f(z)) \leq L d_X(x,z)$$. This isn't very controversial. Suppose I didn't know much about $f$ and thus sampled it at $n$ points, $x_1$ through $x_n$. If I wanted to know about the value of $f$ at a point $z$, I could say that $d_Y(f(x_n),f(z)) \leq L d_X(x_n,z) \forall n$. Still not very controversial, but hey, we have bounds.

Let's think now, what happens if $Y$ is standard one-dimensional Euclidean space? Well, then we now have $$\|f(x_n)-f(z)\| \leq L d_X(x_n,z) \forall n$$. Well, since distance is always non-negative we can say $-L d_X(x_n,z) \leq f(z)-f(x_n) \leq L d_X(x_n,z)$. If we add that $f(x_n)$ term to all parts we then get: $$f(x_n)-L d_X(x_n,z) \leq f(z) \leq f(x_n) + L d_X(x_n,z)$$Now, since we have these inequalities for all $n$ we can say $$\max_n \left(f(x_n)-L d_X(x_n,z)\right) \leq f(z) \leq \min_n \left(f(x_n) + L d_X(x_n,z)\right)$$Since all parts of this inequality are calculable, this presents us with a clean way to bound our estimate.

Now we can start being a little fun. Supposing we know nothing about the underlying function, if we assume our samples are not 'special' relative to the underlying function (and we assume a bunch of other very general things which I don't want to do now because I need to sleep), it's valid to treat the underlying function as a set of observations of a Brownian motion with a variance $L$. From this we can at a minimum identify a MLE for $L$ and use that within the above bounds calculation. 

QL decomposition

 A couple days ago I encountered a really neat solution to creating a basis using a Cholesky decomposition. Particularly, when we have a set of basis function $Y = {y_0,y_1,\cdots , y_n}$, we will want a set of coefficient vectors  $C = {\vec{c}_0,\vec{c}_1,\cdots , \vec{c}_n}$ such that $\langle \sum_j^n c_{k,j} y_j,  \sum_i^n c_{l,i} y_i\rangle = \delta_{k,l}$, and, ideally we also wish the constraint $c_{k,j} = 0 \forall j>k$. More succinctly, an orthonormal basis where there are only order $k$ or lower terms for the $k$-th basis element. Putting that into math terms (and using some abusive notation for the matrix of inner products) we want to find $R$ such that $R^TY^TYR = I$.

The thinking man would use a cholesky decomposition and an inverse to say that if $Y^TY = LL^T$ then $L^{-1}Y^TYL^{-T} = I$, and as $L^{-T}$ is upper triangular, we're finished.

I however am not a thinking man, just a mindless automaton, driven by a basic response to stimuli and only distantly related to chordate as a whole. As such I only have two tools at my disposal, an SVD, and a QR decomposition. How then may I solve this?

Given symmetry we know that if $Y^TY = U \Sigma V^T$ then $U = V$ and thus $Y^TY = U \Sigma U^T$. Multiplying out on both sides we get $\Sigma^{-1/2}U^TY^TYU \Sigma^{-1/2}= I$. This is nice but not upper triangular. We note here that if we found $Q$ and $R$ such that $U \Sigma^{-1/2}=RQ$, we could then multiply $Q^TR^TY^TYRQ$ by $Q$ on the left and $Q^T$ on the right to get $R^TY^TYR = QQ^TR^TY^TYRQQ^T = QQ^T = I$.

How then do we find $R$ and $Q$? Well, we see this is equivalent to finding $Q$ and $R$ such that $\Sigma^{-1/2}U^T = Q^T R^T$. This is a QR decomposition with a lower and not upper triangular matrix.

Because common sense can not and will not prevail we do the obvious thing, which is look at what happens when we 'flip' a matrix. Formally, $flip(A^{n \times m},0)_{i,j} = A_{n-i,j}$ and $flip(A^{n \times m},1)_{i,j} = A_{i,m-j}$, that is, flipping the matrix left to right and up and down.

When we apply this flip operation to a matrix product, we see $flip(AB,0) = flip(A,0) B$ and $flip(AB,1) = A flip(B,1)$.

Suppose then that we took a QR of $flip(\Sigma^{-1/2}U^T,1) = \tilde{Q} \tilde{R}$, we see then that $\tilde{Q} ^T flip(\Sigma^{-1/2}U^T,1) = \tilde{R}$, and then that $flip(\tilde{Q} ^T flip(\Sigma^{-1/2}U^T,1),0) = flip(\tilde{Q} ^T,0) flip(\Sigma^{-1/2}U^T,1) = flip(\tilde{R},0)$, and then $flip(flip(\tilde{Q} ^T,0) flip(\Sigma^{-1/2}U^T,1),1) =flip(\tilde{Q} ^T,0) \Sigma^{-1/2}U^T = flip(flip(\tilde{R},0),1)$. Thus, as $flip(flip(\tilde{R},0),1)$ is lower triangular we get a solution for $\Sigma^{-1/2}U^T = Q^T R^T$ where $R = flip(flip(\tilde{R},0),1)^T$. Here's some code you can run to verify it for yourself.

u,s,v = np.linalg.svd(Y.T @ Y)
c = (u/np.sqrt(s))
q,r = np.linalg.qr(np.flip(c.T,axis=1));
c = np.flip(np.flip(r,axis=0),axis=1).T

This process can be used to generate decompositions for every form and order of triangular matrix by orthonormal matrix.

March thoughts

Lets start by taking any system of ordinary differential equations. We can of course convert this to a first order system by creating stand-...