Kernel Least-Squares

  • Motivations

  • The kernel trick

  • Nonlinear case

  • Examples of kernels

  • Kernels in practice

Motivations

Consider a linear auto-regressive model for time-series, where y_t is a linear function of y_{t-1},y_{t-2}

 y_t = w_1 + w_2 y_{t-1} + w_3 y_{t-2}, ;; t=1,ldots,T.

This writes y_t = w^Tx_t, with x_t the ‘‘feature vectors’’

 x_t := left( 1, y_{t-1}, y_{t-2}right), ;; t=1,ldots,T.

We can fit this model based on historical data via least-squares:

 min_w : |X^Tw - y|_2^2

The associated prediction rule is hat{y}_{T+1} = w_1 + w_2 y_{T} + w_3 y_{T-1} = w^Tx_{T+1}.

We can introduce a non-linear version, where y_t is a quadratic function of y_{t-1},y_{t-2}

 y_t = w_1 + w_2 y_{t-1} + w_3 y_{t-2} + w_4 y_{t-1}^2 + w_5 y_{t-1} y_{t-2} + w_6 y_{t-2}^2 .

This writes y_t = w^Tphi(x_t), with phi(x_t) the augmented feature vectors

 phi(x_t) := left( 1, y_{t-1}, y_{t-2}, y_{t-1}^2, y_{t-1} y_{t-2}, y_{t-2}^2right).

Everything the same as before, with x replaced by phi(x).

It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors. How do we do it in a computationally efficient manner?

The kernel trick

We exploit a simple fact: in the least-squares problem

 min_w : |X^Tw - y|_2^2 + lambda |w|_2^2

the optimal w lies in the span of the data points (x_1,ldots,x_m):

 w = X v

for some vector v in mathbf{R}^m. Indeed, from the fundamental theorem of linear algebra, every w in mathbf{R}^n can be written as the sum of two orthogonal vectors:

 w = Xv + r

where X^Tr = 0 (that is, r is in the nullspace {cal N}(X^T)).

Hence the least-squares problem depends only on K:=X^TX:

 min_v : |Kv-y|_2^2 + lambda v^TKv.

The prediction rule depends on the scalar products between new point x and the data points x_1,ldots,x_m:

 w^Tx = v^TX^Tx = v^Tk, ;; k := X^Tx = (x^Tx_1, ldots, x^Tx_m).

Once K is formed (this takes O(n)), then the training problem has only m variables. When n >> m, this leads to a dramatic reduction in problem size.

Nonlinear case

In the nonlinear case, we simply replace the feature vectors x_i by some ‘‘augmented’’ feature vectors phi(x_i), with phi a non-linear mapping.

This leads to the modified kernel matrix

 K_{ij} = phi(x_i)^Tphi(x_j), ;; 1 le i,j le m.

The kernel function associated with mapping phi is

 k(x,z) = phi(x)^Tphi(z).

It provides information about the metric in the feature space, eg:

 |phi(x)-phi(z)|_2^2 = k(x,x) - 2 k(x,z) + k(z,z).

The computational effort involved in

  1. solving the training problem;

  2. making a prediction,

depends only on our ability to quickly evaluate such scalar products. We can't choose k arbitrarily; it has to satisfy the above for some phi.

Examples of kernels

A variety of kernels are available. Some are adapted to the structure of data, for example text or images. Here are a few popular choices.

Polynomial kernels

Regression with quadratic functions involves feature vectors

  phi(x) = (1,x_1,x_2,x_1^2,x_1x_2,x_2^2).

In fact, given two vectors x,z in mathbf{R}^2, we have

 phi(x)^Tphi(z) = (1+x^Tz)^2.

More generally when phi(x) is the vector formed with all the products between the components of x in mathbf{R}^n, up to degree d, then for any two vectors x,z in mathbf{R}^n,

 phi(x)^Tphi(z) = (1+x^Tz)^d.

The computational effort grows linearly in n.

This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:

  1. Form phi(x), phi(z);

  2. evaluate phi(x)^Tphi(z). In the above approach the computational effort grows as n^d.

Gaussian kernels

Gaussian kernel function:

 	k(x,z) =   expleft(-frac{|x-z|_2^2}{2sigma^2}right),

where sigma>0 is a scale parameter. This allows to ignore points that are too far apart. Conrresponds to a non-linear mapping phi to infinite-dimensional feature space.

Other kernels

There is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc).

Kernels in practice

  1. Kernels need to be chosen by the user.

  2. Choice not always obvious; Gaussian or polynomial kernels are popular.

  3. We control over-fitting via cross validation (to choose, say, the scale parameter of Gaussian kernel, or degree of polynomial kernel).