*This post was originally published by at Towards Data Science*

`Gaussian Processes`

have been around for a while but it’s only really in the past 5–10 years that there’s been a big resurgence in its interest. Partially this is due to the computational complexity of solving for them: as their models require a matrix inversion, the complexity is `O(n³)`

and it’s tough to get it quicker. Because of this, it’s been intractable for a while because computational power has been so weak but in the past few years, with so much research and money behind `ML`

, it’s become a lot more possible.

One of the cool features of `Gaussian Processes`

is that they’re very, very similar to `neural networks.`

So similar in fact that it’s relatively well known that actually, `Gaussian Processes (GP’s)`

are equivalent to single-layer fully-connected `neural networks`

with an i.i.d. prior over its parameters.

I’d like to make this point clear: the proof below is quite simple but it has far standing consequences. The `central limit theorem`

can unify visibly complicated phenomena in such a way where in this case, the best performing model can be seen to be a subset of a machine learning model who’s field has not fully matured yet.

Yes, research into `GP’s`

has been withstanding, but only in the past few years have researchers developed `deep gaussian processes`

which can characterise non-linear patterns (like jumps), which DNN’s were made to do (specifically being able to model the XOR logic. So from this we can see, there’s so much more to gain.

I always wanted to look into this proof and the below was unassumingly simple. The following has been taken from the paper by Lee et al at Google Brain so I’d like to thank them for making it so accessible.

## A bit of notation

Note: you can’t do subscript for everything on Medium so if you see an underscore, (i.e. M_l), assume this means M with l as a subscript. So an M_i = Mᵢ

Consider an L layer fully connected neural network with `hidden layers`

of width N_l (for layer L). Let x ∈ Rᵈᶦ denote the input to the `network`

and let z ˡ denote its output (at layer L). The i’th component of the `activations`

in the l’th layers are denoted as xˡᵢ and zˡᵢ. `Weights`

and `bias`

parameters at the l’th layer have components Wˡᵢ and bˡᵢ which are iid, and assume them to have zero `mean`

and `variances`

of σ²_w/N_l.

## The `Neural Network`

Now we know that the i’th component of a `neural network`

output (zˡᵢ) is computed as follows:

where we have shown the dependence on the input x. As the `weight`

and `bias`

parameters are assumed to be iid, the pos-activation functions for xˡᵢ and xˡᵢ’ are independent for j=/=j’.

Now as zˡᵢ(x) is a sum of iid terms, it follows the `central limit theorem`

so that in the limit of infinite width (N¹-> ∞), zˡᵢ(x) is also therefore `gaussian`

`distributed`

.

## The Gaussian Process

Likewise, from the `multidimensional`

CLT, we can infer than any finite collection of variable z will be joint `multivariate gaussian`

, which happens to be the exact definition of our `Gaussian Process`

.

Therefore we can conclude that a zˡᵢ(x)~GP(µ¹, K¹ ) where a GP is a `Gaussian Process`

with a mean of µ¹ and a covariance of K¹, which are themselves independent of i. As the parameters have zero mean, we have that µ¹=0 but that K¹(x,x’) is as follows:

where this `covariance`

is obtained by integrating against the distribution of W⁰ and b⁰. Note now that as any two zˡᵢ and zˡj for i=/=j, are `jointly Gaussian`

and have zero covariance, they are guaranteed to be independent despite utilising the same features produced by the hidden layer.

Some proofs are simple and logical and the magic of the central limit theorem is that it unifies everything under a `Gaussian Distribution`

. `Gaussian distributions`

are great because `marginalising`

and `conditioning`

on a variable (or dimension) results in `Gaussian distribution`

and also the functional form is quite simple, so things can be condensed into closed form solutions (so rarely are optimisation techniques required).

*This post was originally published by at Towards Data Science*