Linear Binary Classification

LP and QP > Applications > Classification | Next
  • Binary classification

  • Linear binary classification

  • Encouraging sparsity

  • Robustness

Binary classification problems

Where do they arise?

Binary classification problems arise when we seek to separate two sets of data points in mathbf{R}^n, each corresponding to a given class. We seek to separate the two data sets using simple ‘‘boundaries’’ in mathbf{R}^n, typically hyperplanes. Once the boundary is found, we can use it to predict the class a new point belongs to, by simply checking on which side of the boundary it falls.

Such problems arise in many situations:

  • In recommendation systems, say for movies, we have information about a given user's movie preferences. Hence we may separate the whole movie database in two different sets (movies that the user's like, and the others). The goal of classification is to determine wether a given new movie will be liked by the user.

  • In spam filtering, we may have a set of emails which we know are spam, and another set of emails that are known to be legitimate.

  • In fault detection, we have a set of past signals, some of which are known to correspond to a fault, and others known to correspond to a situation free of fault. The goal of classification is to predict wether a new measured signal corresponds to a fault.

  • In document classification we may be interested in determining wether a given news article is about a given topic of interest, or not.

  • In time-series prediction, we may try to determine wether a future value (such as a stock price) will increase or decrease with respect to its current value.

Features

Each data point in the classification problem is a vector that contains a numerical representation the features that we use to make a class prediction. Features may include:

  • frequencies (or some other numerical score) of given words in a dictionary (see the bag-of-words representation of text;

  • Boolean variables that determine the presence or absence of a specific feature (such as wether a specific actor played in the movie that the data point represents);

  • Numerical values such as blood pressure, temperature, prices, etc.

Basics of linear classification

Assume we are given a collection of data points, x_i in mathbf{R}^n, which comes with a label y_i in {-1,1} that determines which class it belongs to.

The linear binary classification problems involves a ‘‘linear boundary’’, that is a hyperplane. An hyperplane can be described via the equation

 a^T x = b,

for some a in mathbf{R}^n and b in mathbf{R}.

Such a line is said to correctly classify these two sets if all data points with y_i= +1 fall on one side (hence a^Tx ge b) and all the others on the other side (hence a^Tx le b). Hence, the affine inequalities on (a,b):

 y_i(a^Tx_i - b) ge 0, ;; i=1,ldots,m,

guarantee correct classification. The above constitute linear (in fact, affine) inequalities on the decision variable, (a,b) in mathbf{R}^{n+1}. This fact is the basis on which we can build a linear programming solution to a classification problem.

alt text 

The image shows a linear classifier which separates two sets of points in mathbf{R}^2. In two dimensions, a hyperplane corresponds to a simple line. In this example, the data sets are linearly separable.

Classification rule

Once a classifier (a,b) is found, we can classify a new point x in mathbf{R}^n by assigning to it the label

 hat{y} := mbox{bf sign}(a^Tx-b) .

The above constitutes the classification rule.

Strict separability

The data is strictly separable (meaning that the separating hyperplane does not contain any data points) if and only if the above inequalities can be made strict for some choice of (a,b). If that is the case, then we can always scale the variable (a,b) and obtain the inequalities:

 y_i(a^Tx_i - b) ge 1, ;; i=1,ldots,m.

Case of non-separable data sets

The previous constraints are feasible if and only if the data is strictly separable. If the data is not strictly separable, we can allow for errors in the inequalities, and minimize the sum of these errors.

We obtain the problem

 min_{a,b,v} : sum_{i=1}^m v_i ~:~ v_i ge 0, ;; y_i(a^Tx_i - b) ge 1-v_i, ;; i=1,ldots,m.

The geometric interpretation of the above is that we are minimizing the sum of the distances from the wrongly classified points, to the separating hyperplane.

In the above problem, we can eliminate the variable v and obtain a formulation involving the minimization of a polyhedral function:

 min_{a,b} : sum_{i=1}^m (1 - y_i (a^Tx_i - b))_+

where the notation alpha_+ := max(alpha,0) denotes the positive part of a real number alpha.

Feature selection

Motivation

In many cases, a sparse classifier, that is, a vector a with many zeros, is desirable.

Indeed, the classification rule hat{y} := mbox{bf sign}(a^Tx-b) involves the scalar product between the classifier vector a and a feature vector x. If a_i = 0 for some i, then the rule ignores the value of the i-th feature to make a prediction about the label. In this sense, the i-th feature is ‘‘not important’’ in the classification task. Thus, the classifier allows not only to make a prediction, but also to single out important features in the data. This is referred to as feature selection.

Feature selection via l_1-norm

Assume that the data is separable. We can try to minimize the cardinality of (number of non-zero elements in) the classifier vector a, under the constraint that the classifier (a,b) has no errors on the data set. This is a hard problem.

Instead, we can use the l_1-norm heuristic, which consists of replacing the cardinality objective by the l_1-norm of a. In the separable case, we end up with the constrained polyhedral minimization problem

 min_{a,b} : |a|_1 ~:~ y_i(a^Tx_i - b) ge 1, ;; i=1,ldots,m.

The above can be formulated as an LP.

If the data is not separable, we need to trade-off our concern for sparsity, against the error function we used above. This is done via the polyhedral minimization problem:

 sum_{i=1}^m (1 - y_i (a^Tx_i - b))_+  + lambda |a|_1,

where lambda>0 is a penalty parameter that allows to choose our trade-off.

Robustness

Against box uncertainty

In some cases, the data points are not known exactly. A typical uncertainty model is to assume that each data point x_i is only known to belong to a ‘‘box’’ around a given point hat{x}_i:

 |x_i - hat{x}_i|_infty le rho,

where rho>0 is a measure of the size of the uncertainty, and hat{x}_i's represent the (known) ‘‘nominal’’ values of the data points.

Let us assume that the nominal data points are strictly separable. A robust classifier corresponds to a hyperplane that not only separates the known points hat{x}_i, but the boxes around them. That is, a robust hyperplane satisfies, for every i= 1,ldots,m:

 forall : x, ;; |x_i - hat{x}_i|_infty le rho ~:~ y_i(a^Tx_i - b) ge 0 .

The above condition is equivalent to

 y_i(a^Tx_i - b) ge rho |a|_1, ;; i=1,ldots,m.

The above condition guarantees that the data points remain separated even if we perturb the original ones (staying in the boxes). This approach requires the knowledge of the ‘‘uncertainty size’’, rho.

If rho is not known, we can simply find a classifier which maximizes the allowable perturbation level rho. Using homogeneity, we can always enforce rho |a|_1 = 1. Then, maximizing rho is equivalent to minimizing the l_1-norm, that is:

 min_{a,b} : |a|_1 ~:~ y_i(a^Tx_i - b) ge 1, ;; i=1,ldots,m.

This model is the same as the one we saw for feature selection based on the l_1-norm heuristic.

Against spherical uncertainty

We can consider another data uncertainty model, in which the data points are only known to belong to hyper-spheres:

 |x_i - hat{x}_i|_2 le rho,

where rho>0 is a measure of the size of the uncertainty.

Assuming again the nominal data points to be strictly separable, leads to the problem

 min_{a,b} : |a|_2 ~:~ y_i(a^Tx_i - b) ge 1, ;; i=1,ldots,m.

The above is a QP.

If the data is not separable, we need to trade-off our concern for robustness, against the error function we used above. This is done via the QP problem:

 min_{a,b} : sum_{i=1}^m (1 - y_i (a^Tx_i - b))_+  + lambda |a|_2^2,

where lambda>0 is a penalty parameter that allows to choose our trade-off.