Kernel Least-Squares

Least-Squares > Ordinary LS | Variants | Kernels | Applications

Motivations
The kernel trick
Nonlinear case
Examples of kernels
Kernels in practice

Motivations

Consider a linear auto-regressive model for time-series, where y_t is a linear function of $y_{t-1},y_{t-2}$

$y_t = w_1 + w_2 y_{t-1} + w_3 y_{t-2}, ;; t=1,ldots,T.$

This writes y_t = w^Tx_t , with x_t the ‘‘feature vectors’’

$x_t := left( 1, y_{t-1}, y_{t-2}right), ;; t=1,ldots,T.$

We can fit this model based on historical data via least-squares:

The associated prediction rule is $hat{y}_{T+1} = w_1 + w_2 y_{T} + w_3 y_{T-1} = w^Tx_{T+1}$ .

We can introduce a non-linear version, where y_t is a quadratic function of $y_{t-1},y_{t-2}$

$y_t = w_1 + w_2 y_{t-1} + w_3 y_{t-2} + w_4 y_{t-1}^2 + w_5 y_{t-1} y_{t-2} + w_6 y_{t-2}^2 .$

This writes y_t = w^Tphi(x_t) , with the augmented feature vectors

$phi(x_t) := left( 1, y_{t-1}, y_{t-2}, y_{t-1}^2, y_{t-1} y_{t-2}, y_{t-2}^2right).$

Everything the same as before, with replaced by phi(x) .

It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors. How do we do it in a computationally efficient manner?

The kernel trick

We exploit a simple fact: in the least-squares problem

the optimal lies in the span of the data points (x_1,ldots,x_m) :

for some vector $v in mathbf{R}^m$ . Indeed, from the fundamental theorem of linear algebra, every $w in mathbf{R}^n$ can be written as the sum of two orthogonal vectors:

where X^Tr = 0 (that is, is in the nullspace ${cal N}(X^T)$ ).

Hence the least-squares problem depends only on K:=X^TX :

The prediction rule depends on the scalar products between new point and the data points x_1,ldots,x_m :

w^Tx = v^TX^Tx = v^Tk, ;; k := X^Tx = (x^Tx_1, ldots, x^Tx_m).

Once is formed (this takes O(n) ), then the training problem has only variables. When n >> m , this leads to a dramatic reduction in problem size.

Nonlinear case

In the nonlinear case, we simply replace the feature vectors x_i by some ‘‘augmented’’ feature vectors phi(x_i) , with phi a non-linear mapping.

This leads to the modified kernel matrix

$K_{ij} = phi(x_i)^Tphi(x_j), ;; 1 le i,j le m.$

The kernel function associated with mapping phi is

It provides information about the metric in the feature space, eg:

|phi(x)-phi(z)|_2^2 = k(x,x) - 2 k(x,z) + k(z,z).

The computational effort involved in

solving the training problem;
making a prediction,

depends only on our ability to quickly evaluate such scalar products. We can't choose arbitrarily; it has to satisfy the above for some phi .

Examples of kernels

A variety of kernels are available. Some are adapted to the structure of data, for example text or images. Here are a few popular choices.

Polynomial kernels

Regression with quadratic functions involves feature vectors

phi(x) = (1,x_1,x_2,x_1^2,x_1x_2,x_2^2).

In fact, given two vectors $x,z in mathbf{R}^2$ , we have

More generally when phi(x) is the vector formed with all the products between the components of $x in mathbf{R}^n$ , up to degree , then for any two vectors $x,z in mathbf{R}^n$ ,

The computational effort grows linearly in .

This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:

Form , ;
evaluate . In the above approach the computational effort grows as .

Gaussian kernels

Gaussian kernel function:

$k(x,z) = expleft(-frac{|x-z|_2^2}{2sigma^2}right),$

where sigma>0 is a scale parameter. This allows to ignore points that are too far apart. Conrresponds to a non-linear mapping phi to infinite-dimensional feature space.

Other kernels

There is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc).

Kernels in practice

Kernels need to be chosen by the user.
Choice not always obvious; Gaussian or polynomial kernels are popular.
We control over-fitting via cross validation (to choose, say, the scale parameter of Gaussian kernel, or degree of polynomial kernel).