Linear classification

Consider the problem of finding a line which separates two data sets in two dimensions. These data sets may represent measurements about two kinds of populations (the first set might contain spam emails, while the second might contain legitimate ones). Each axis represents the frequency (say) of certain keywords in the email at hand — in our case, we look at two keywords, but in practice we might want to involve thousands of possible keywords.

We formalize this as follows. Each data point is given by its coordinates $(z_{i,1},z_{i,2})$ and has a label $y_i in {-1,1}$ which determines wether it is spam or not. A (possibly vertical) line in -space, is parametrized by a vector $x=(x_1,x_2,x_3) in mathbf{R}^3$ , via the equation x_1 z_1 + x_2 z_2 + x_3 = 0 . Such a line is said to correctly classify these two sets if all data points with y_i= +1 fall on one side (hence x_1 z_1 + x_2 z_2 + x_3 ge 0 ) and all the others on the other side (hence x_1 z_1 + x_2 z_2 + x_3 le 0 ). Hence, the affine inequalities on

$y_i(x_1 z_{i,1} + x_2 z_{i,2} + x_3) ge 0, ;; i=1,ldots,m,$

guarantee correct classification. The above can be used as constraints in an optimization problem, to derive an ‘‘optimal’’ line.

Once a line is found, a new point (for which we do not have a label) can be classified, by checking on which side of the line it falls. This is further discussed here.

The image shows a line which separates two sets of points in $mathbf{R}^2$ . An optimization problem can be formed to find such a line, even in more than two dimensions.