Gaussian process regression

Jihong Ju on February 14, 2019

A big advantage of Bayesian regression models over the other regression methods like logistic regression is that Bayesian regression models the probability distribution of the parameter $ \theta ​$ over the entire parameter space, whereas logistic regression only finds a choice of the parameters that maximize the likelihood.

In this post, we will discuss an extension of Bayesian linear regression model, the Gaussian process regression model.

Multivariate Gaussian

Before we move forward, Let us quick recap some basics of Gaussian distribution.

We denote a set of random variables $x \in R^n$ sampled from a Gaussian distribution with mean values $\mu \in R^n$ and covariance matrix $\Sigma \in S^{n}_{++}$ as:

Suppose there are two sets of variables

The conditional probability distribution of $x_A$ given $x_B$ is

and similarly for $ x_B \vert x_A $.

Gaussian process

Gaussian process is an extension of Gaussian distribution. Reminds that Gaussian distribution is a distribution of stochastic variables. Gaussian process models a distribution of stochastic process, namely functions that can be defined by a collection of random variables.

Let

be a finite collection of element $x \in R^n$ and

be a collection of random variables where

is a function that maps $x_m$ to $f(x_m)$.

There could be infinite numbers of such functions $f$. To represent a particular function, $f_0$, for example, we use a vector representation

Any finite sub-collection of random variables sampled from a Gaussian process with mean function $ m(\cdot) ​$ and covariance function $ k(\cdot) ​$ has

denoted as

where $m(\cdot)$ can be any real-valued function, but $k(\cdot, \cdot)$ must fulfill that the resulting kernel matrix $K$ is a valid covariance matrix, i.e., $K \in S^{m}_{++}$.

An example of such kernel is the squared exponential kernel: This kernel has the property of local smoothness because:

  • High correlation $k(x, x^{\prime}) \to 1 $ as $x$ and $x^{\prime}$ becomes close to each other, i.e., $\lVert x - x^{\prime} \rVert \to 0$
  • Low correlation $k(x, x^{\prime}) \to 0 $ as $x$ and $x^{\prime}$ becomes far apart, i.e, $\lVert x - x^{\prime} \rVert \to \infty$

Gaussian process regression

Let be the training set of i.i.d. examples from an unknown distribution.

Let be the test set of i.i.d. examples from the same distribution.

We have the Gaussian process regression model

where $\epsilon \sim \mathcal({0, \sigma^2})$ models the stochastic noise by drawing i.i.d. examples from an independent Gaussian distribution.

With a zero-mean Gaussian process prior

The vector representations of $f$ with the training set and with the test set satisfies

Given our Gaussian process regression model and the i.i.d. noise assumption,

Now using the conditional distribution rule for Gaussian recapped in the beginning of the post, we can compute the posterior predictive distribution analytically:

where