SciGood: Distribution recovery (pt.1)

This series is going to be about a method to identify the family of distributions that a set of random samples have been drawn from. This one has to be multi-part because too many different things are involved here, but here's a general table of contents.

Sort the data
Note that the distance between sequential data-points is inversely related to the density function
Note then that the changes in distance between successive points provides information on the derivative of the underlying density function
Examine the data in the context of it's CDF, allowing for us to derive the probability of particular spacings given a particular PDF
Compose a finite difference scheme and use it to construct a matrix $F$ that has provably expected values of $\left[p(x),p'(x),p''(x), \cdots \right]$
Use the knowledge that particular probability distributions have specific families of ODEs that govern them, such that $N\left[p\right] = 0$
Use the smallest singular vector of $F$ to identify an estimate of the specific differential equation governing the sample distribution
Match this singular vector to one of the manifolds that contain a family of distributions
Write angry letters to the guys at fitter

Seems simple enough right? Intuitively "The difference between sample points encodes information about the density and how it changes. When the density is governed by a differential equation, this will be reflected in an observable restriction on the differences between sample points".

Lets start with a set of $N$ samples from a distribution with a pdf $p: \left(X \in \mathbb{R}\right) \to \mathbb{R}^+$ and a cdf $c: \left(X \in \mathbb{R}\right) \to \left[0,1\right]$. We index these samples based on their value and then use this to create an ordered list $x = \left(x_1,x_2,\cdots , x_N \right)$ where $x_n \geq x_m$ if $n>m$.

Since we know that the values of $c(x)$ for our sample are uniformally distributed, it makes sense to look at the conditional probability of the values of $c_{n+1}$ given $c_n$ (Note that here we use the notation $c(x_n) = c_n$ for compactness). We see that asking for the probability of the next point being greater than a value $Z$ is equivalent to asking the probability that every following cdf sample which we know to be between $c_n$ and $1$ does not lie in the region between $Z$ and $c_n$. This is binomial, so we get: $$P(c_{n+1} \geq Z|c_n) = \left(\frac{\left(1-Z\right)}{\left(1-c_n\right)}\right)^{N-n}$$Taking the derivative, we then find that$$P(c_{n+1} = Z|c_n) = \left(N-n\right)\frac{\left(1-Z\right)^{N-n-1}}{\left(1-c_n\right)^{N-n}}$$This is then enough to create:$$P(c_{n+k},c_{n+k-1},\cdots , c_{n+1}|c_n) = \prod^{k}_{j=1} P(c_{n+j}|c_{n+j-1})$$ $$ \prod^{k}_{j=1} P(c_{n+j}|c_{n+j-1}) = \left(\prod^{k-1}_{j=0}\left(N-n-j\right)\right) \prod^{k}_{j=1}\frac{\left(1-c_{n+j}\right)^{N-n-j}}{\left(1-c_{n+j-1}\right)^{N-n-j+1}}$$ $$\left(\prod^{k-1}_{j=0}\left(N-n-j\right)\right) \prod^{k}_{j=1}\frac{\left(1-c_{n+j}\right)^{N-n-j}}{\left(1-c_{n+j-1}\right)^{N-n-j+1}} = \frac{\left(N-n\right)!}{\left(N-n-k\right)!}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}}$$This result might seem unintuitive, that the intermediary samples don't contribute to the probability of the sequence, but it's a natural result of any particular sequence in a uniform distribution being equally likely. The good part comes from the following question: "If we don't care about the intermediate values at all, how does this influence the results". That is, what is $P\left(c_{n+k}|c_n\right)$? $$P\left(c_{n+k}|c_n\right) = \int P(c_{n+k},c_{n+k-1},\cdots , c_{n+1}|c_n) d c_{n+1} \cdots c_{n+k-1}$$ $$=\frac{\left(N-n\right)!}{\left(N-n-k\right)!}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}} \int 1 d c_{n+1} \cdots c_{n+k-1}$$This integral is the n-volume ($V_n$) of the feasible region that preserves the ordering, i.e. $c_{n+1} \cdots c_{n+k-1}$ such that $c_n \leq c_{n+j-1} \leq c_{n+j} \leq c_{n+j+1} \leq c_{n+k} \forall 1 \geq j < k$. This can be calculated inductively for $c_n = 0, c_{n+k} = 1$, with $V_{m+1} = \frac{V_{m}}{m+1}$ and $V_1 = 1$. This gives us $V_{m} = \frac{1}{m!}$, which when rescaled for the actual range of $c_n$ to $c_{n+k}$ via the factor $\left(c_{n+k}-c_n\right)^{k-1}$ gives us the following solution:$$P\left(c_{n+k}|c_n\right) =\frac{\left(N-n\right)!}{\left(k-1\right)!\left(N-n-k\right)!}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}}\left(c_{n+k}-c_n\right)^{k-1}$$ $$=k {N - n \choose k}\frac{\left(1-c_{n+k}\right)^{N-n-k}}{\left(1-c_n\right)^{N-n}}\left(c_{n+k}-c_n\right)^{k-1}$$

Now we can recast this in terms of the samples $x$. Recall that if $z = g(x)$ and $g(x)$ is monotonically increasing then $$p(z) = p(g^{-1}(z)) \frac{d}{dz} g^{-1}(z)$$Using this we can then note that as $c_n = c(x_n)$ and $x_n = c^{-1}(c_n)$ (N.B. that $c$ is always invertible in its domain) we then use our change of variables of $P\left(c_{n+k}|c_n\right)$ into $P\left(x_{n+k}|x_n\right)$. This grants us:$$P\left(x_{n+k}|x_n\right) = k {N - n \choose k}\frac{\left(1-c(x_{n+k})\right)^{N-n-k}}{\left(1-c(x_n)\right)^{N-n}}\left(c(x_{n+k})-c(x_n)\right)^{k-1} \frac{d}{d x_{n+k}} c(x_{n+k})$$ $$ = k {N - n \choose k}\frac{\left(1-c(x_{n+k})\right)^{N-n-k}}{\left(1-c(x_n)\right)^{N-n}}\left(c(x_{n+k})-c(x_n)\right)^{k-1} p(x_{n+k})$$.

Now the pdf has dropped out. In the next post we will go from the above $P\left(c_{n+k}|c_n\right)$ in terms of the cdf values and exploit the fact that the expectation of a function of $f(c_{n+k})$ can be reformulated as a Jocobi integral transform to work with it.

SciGood

Distribution recovery (pt.1)

No comments:

Post a Comment

March thoughts

Search This Blog