Linear Binary Classification
Binary classification problemsWhere do they arise?Binary classification problems arise when we seek to separate two sets of data points in , each corresponding to a given class. We seek to separate the two data sets using simple ‘‘boundaries’’ in , typically hyperplanes. Once the boundary is found, we can use it to predict the class a new point belongs to, by simply checking on which side of the boundary it falls. Such problems arise in many situations:
FeaturesEach data point in the classification problem is a vector that contains a numerical representation the features that we use to make a class prediction. Features may include:
Basics of linear classificationAssume we are given a collection of data points, , which comes with a label that determines which class it belongs to. The linear binary classification problems involves a ‘‘linear boundary’’, that is a hyperplane. An hyperplane can be described via the equation for some and . Such a line is said to correctly classify these two sets if all data points with fall on one side (hence ) and all the others on the other side (hence ). Hence, the affine inequalities on : guarantee correct classification. The above constitute linear (in fact, affine) inequalities on the decision variable, . This fact is the basis on which we can build a linear programming solution to a classification problem.
Classification ruleOnce a classifier is found, we can classify a new point by assigning to it the label The above constitutes the classification rule. Strict separabilityThe data is strictly separable (meaning that the separating hyperplane does not contain any data points) if and only if the above inequalities can be made strict for some choice of . If that is the case, then we can always scale the variable and obtain the inequalities: Case of non-separable data setsThe previous constraints are feasible if and only if the data is strictly separable. If the data is not strictly separable, we can allow for errors in the inequalities, and minimize the sum of these errors. We obtain the problem The geometric interpretation of the above is that we are minimizing the sum of the distances from the wrongly classified points, to the separating hyperplane. In the above problem, we can eliminate the variable and obtain a formulation involving the minimization of a polyhedral function: where the notation denotes the positive part of a real number . Feature selectionMotivationIn many cases, a sparse classifier, that is, a vector with many zeros, is desirable. Indeed, the classification rule involves the scalar product between the classifier vector and a feature vector . If for some , then the rule ignores the value of the -th feature to make a prediction about the label. In this sense, the -th feature is ‘‘not important’’ in the classification task. Thus, the classifier allows not only to make a prediction, but also to single out important features in the data. This is referred to as feature selection. Feature selection via -normAssume that the data is separable. We can try to minimize the cardinality of (number of non-zero elements in) the classifier vector , under the constraint that the classifier has no errors on the data set. This is a hard problem. Instead, we can use the -norm heuristic, which consists of replacing the cardinality objective by the -norm of . In the separable case, we end up with the constrained polyhedral minimization problem The above can be formulated as an LP. If the data is not separable, we need to trade-off our concern for sparsity, against the error function we used above. This is done via the polyhedral minimization problem: where is a penalty parameter that allows to choose our trade-off. RobustnessAgainst box uncertaintyIn some cases, the data points are not known exactly. A typical uncertainty model is to assume that each data point is only known to belong to a ‘‘box’’ around a given point : where is a measure of the size of the uncertainty, and 's represent the (known) ‘‘nominal’’ values of the data points. Let us assume that the nominal data points are strictly separable. A robust classifier corresponds to a hyperplane that not only separates the known points , but the boxes around them. That is, a robust hyperplane satisfies, for every : The above condition is equivalent to The above condition guarantees that the data points remain separated even if we perturb the original ones (staying in the boxes). This approach requires the knowledge of the ‘‘uncertainty size’’, . If is not known, we can simply find a classifier which maximizes the allowable perturbation level . Using homogeneity, we can always enforce . Then, maximizing is equivalent to minimizing the -norm, that is: This model is the same as the one we saw for feature selection based on the -norm heuristic. Against spherical uncertaintyWe can consider another data uncertainty model, in which the data points are only known to belong to hyper-spheres: where is a measure of the size of the uncertainty. Assuming again the nominal data points to be strictly separable, leads to the problem The above is a QP. If the data is not separable, we need to trade-off our concern for robustness, against the error function we used above. This is done via the QP problem: where is a penalty parameter that allows to choose our trade-off. |