Gaussian process vs kernel ridge regression
10 Sep 2017Consider the prediction problem: given a dataset $\mathcal{D} = \{ (\mathbf{x}_i, y_i) \}_{i=1}^{N}$ of pairs of inputs $\mathbf{x}_i \in \mathbb{R}^n$ and outputs $y_i \in \mathbb{R}$, one wishes to predict the output $y$ given an input $\mathbf{x}$. We first solve this problem using plain ridge regression, also known as weight decay or Tikhonov regularization. After that, we transform the solution by means of the matrix inversion lemma to a form amenable to kernelization—at this point, inner products of feature vectors can be replaced by a kernel function, leading to the kernel ridge regression. Direct comparison of the kernel ridge regression solution with the mean of the predictive distribution returned by a Gaussian process with the same kernel establishes their equivalence. Thus, if the uncertainty of a prediction is irrelevant, Gaussian process regression and kernel ridge regression can be used interchangeably.
Ridge regression
Consider the model $p(y|\mathbf{w}) = \mathcal{N}\left(y | \mathbf{w}^\mathrm{T} \boldsymbol{\phi}(\mathbf{x}), \beta^{-1}\right)$ linear in features $\boldsymbol{\phi}(\mathbf{x}) \in \mathbb{R}^M$ with the output $y$ corrupted by zero-mean Gaussian noise with precision $\beta$. Assembling $N$ outputs into a vector $\mathbf{y} \in \mathbb{R}^N$ and assuming that data points are drawn independently, the likelihood splits into a product of independent terms (see Bishop, Formula 3.10)
Given a Gaussian prior over the parameters $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}|\mathbf{0}, \alpha^{-1}\mathbf{I})$, the mean of the posterior distribution $p(\mathbf{w}|\mathbf{y})$ is a linear function of the targets $\mathbf{y}$ (see Bishop, Formula 3.53)
where $\boldsymbol{\Phi} \in \mathbb{R}^{N \times M}$ is the design matrix with rows $\boldsymbol{\phi}(\mathbf{x}_i)^\mathrm{T}$ and $\lambda = \alpha / \beta$ is a regularization parameter. When a test input $\mathbf{x}$ arrives, the maximum a posteriori estimate of the output is given by the mean of the predictive distribution (see Bishop, Formula 3.58)
where $\boldsymbol{\phi} = \boldsymbol{\phi}(\mathbf{x})$ is the feature vector of the test input $\mathbf{x}$. Formula \eqref{RR} reveals an interesting property of the linear regression: the predictive mean $\mu_{y|\mathbf{y}}$ is a linear combination of the targets $y_i$ from the dataset with weights
That is, $\mu_{y | \mathbf{y}} = \sum_{i=1}^N \omega_i y_i$. Regression functions, such as this, which make predictions by taking linear combinations of the training set target values are known as linear smoothers.
Kernel ridge regression
In the classical regime, the dimensionality of the feature space $M$ is smaller than the number of data points $N$. Thus, matrix $\boldsymbol{\Phi}$ is skinny and it is advisable to use Formula \eqref{RR} because it involves inversion of a small matrix $\boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi}$. However, one may wish to use more expressive high-dimensional representations for which there may be no hope to obtain enough data to get to the classical regime. In this case, $M > N$ and matrix $\boldsymbol{\Phi}$ is fat. One can still apply Formula \eqref{RR} but it may be extremely impractical. Luckily, the matrix inversion lemma allows one to shift the transposition sign and invert $\boldsymbol{\Phi}\boldsymbol{\Phi}^\mathrm{T}$ instead, which may be significantly simpler in the considered circumstances. With the help of the identity
the predictive mean \eqref{RR} can be computed as
Formula \eqref{RR_MIL} is remarkable not only because it allows us to invert a smaller matrix but also because it can be entirely expressed in terms of inner products of feature vectors without requiring access to the feature vectors themselves. This is the basis of the so-called kernel trick. By introducing the kernel function
we can rewrite \eqref{RR_MIL} in the kernelized form as
where $\mathbf{k} = \mathbf{k}(\mathbf{x}) \in \mathbb{R}^N$ is a vector with elements $k_i(\mathbf{x}) = k(\mathbf{x}_i, \mathbf{x})$ and $\mathbf{K} \in \mathbb{R}^{N \times N}$ is the Gram matrix of the set of vectors $\boldsymbol{\phi}_i = \boldsymbol{\phi}(\mathbf{x}_i)$ with elements $K_{ij} = \boldsymbol{\phi}_i^\mathrm{T} \boldsymbol{\phi}_j$. Formula \eqref{KRR}, referred to as kernel ridge regression, has a wider scope of applicability than the ridge regression formulas \eqref{RR} and \eqref{RR_MIL} we started with. Indeed, Formula \eqref{KRR} does not restrict one to the use of finite-dimensional feature vectors but allows for utilization of infinite-dimensional ones, opening the door into the beautiful reproducing kernel Hilbert space.
Gaussian process regression
The final piece of the puzzle is to derive the formula for the predictive mean in the Gaussian process model and convince ourselves that it coincides with the prediction \eqref{KRR} given by the kernel ridge regression. Starting with the likelihood
where $\mathbf{f} = (f_1,\dots,f_N)^\mathrm{T}$ is a vector of evaluations $f_i = f(\mathbf{x}_i)$ of the function $f$ at every point in the dataset, and combining it with a Gaussian process prior over functions $f$ concisely expressed as $p(\mathbf{f}) = \mathcal{N}(\mathbf{f}|\mathbf{0}, \mathbf{K})$, we find the marginal distribution $p(\mathbf{y}) = \mathcal{N}\left( \mathbf{y} | \mathbf{0}, \mathbf{K} + \lambda \mathbf{I} \right)$, which is crucial for making predictions.
Prediction in the Gaussian process model is done via conditioning. In order to find the predictive distribution $p(y|\mathbf{y})$, one conditions the joint distribution
on the observed targets $\mathbf{y}$, which yields (see Bishop, Formula 6.66)
for the predictive mean. Needless to say, Formula \eqref{GPR} for the Gaussian process regression is exactly the same as Formula \eqref{KRR} for the kernel ridge regression.
We conclude that Gaussian process conditioning results in kernel ridge regression for the conditional mean in the same way as plain Gaussian conditioning results in linear regression.