Principal Component Analysis
Projection on a line via variance maximizationConsider a data set of points , in . We can represent this data set as a matrix , where each is a -vector. The variance maximization problem is to find a direction such that the sample variance of the corresponding vector is maximal. Recall that when is normalized, the scalar is the component of along , that is, it corresponds to the projection of on the line passing through and with direction . Here, we seek a (normalized) direction such that the empirical variance of the projected values , , is large. If is the vector of averages of the 's, then the average of the projected values is . Thus, the direction of maximal variance is one that solves the optimization problem The above problem can be formulated as where is the sample covariance matrix of the data. We have see the above problem before, under the name of the Rayleigh quotient of a symmetric matrix. Solving the problem entails simply finding an eigenvector of the covariance matrix that corresponds to the largest eigenvalue.
Principal component analysisMain ideaThe main idea behind principal component analysis is to first find a direction that corresponds to maximal variance between the data points. The data is then projected on the hyperplane orthogonal of that direction. We obtain a new data set, and find a new direction of maximal variance. We may stop the process when we have collected enough directions (say, three if we want to visualize the data in 3D). It turns out that the directions found in this way are precisely the eigenvectors of the data's covariance matrix. The term principal components refers to the directions given by these eigenvectors. Mathematically, the process thus amounts to finding the eigenvalue decomposition of a positive semi-definite matrix, the covariance matrix of the data points. Projection on a planeThe projection to use to obtain, say, a two-dimensional view with the largest variance, is of the form , where is a matrix that contains the eigenvectors corresponding the the first two eigenvalues.
Explained varianceThe total variance in the data is defined as the sum of the variances of the individual components. This quantity is simply the trace of the covariance matrix, since the diagonal elements of the latter contain the variances. If has the EVD , where contains the eigenvalues, and an orthogonal matrix of eigenvectors, then to total variance can be expressed as the sum of all the eigenvalues: When we project the data on a two-dimensional plane corresponding to the eigenvectors associated with the two largest eigenvalues , we get a new covariance matrix , where the total variance of the projected data is Hence, we can define the ratio of variance ‘‘explained’’ by the projected data as the ratio: If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
|