Modified Principal Component Analysis:
PCA can be considered as fitting an n-dimensional ellipsoid
to the data, where each axis of the ellipsoid represents a principal component.
If some axis of the ellipse is small, then the variance along that axis is also
small, and by omitting the axis, the loss can be tolerated.
To find the axes of the ellipse, we subtract the mean from
each variable to center the data around the origin. Then we compute the covariance
matrix of the data and then calculate the eigen values and the corresponding
eigenvectors of this covariance matrix. Then we must orthogonalize the set of
eigen vectors and normalize each to become unit vectors. The resulting eigen
vectors become the axis of the ellipsoid fitted to the data. The proportion of
the variance that each eigenvector represents can be calculated dividing the
eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. The
eigenvectors are ordered such that the greatest variance by some projection of
the data comes to lie on the first co-ordinate, the second greatest variance on
the second co-ordinate and so on.
When we parallelize this algorithm on chunks of data, each
mapper computes the principle eigenvectors of the covariance matrix as 1/m
times the summation of the support vector and support vector transposed from I
= 1 to m and minus the second term comprising of mean and mean transposed. The
latter component consisting of only means can also be expressed as summation
form. Therefore the covariance matrix is all in the summation form.
The modified algorithm adjusts the summation forms the same
way as in the examples of adjusting mean by taking weighted average of the
summation where the weights are the number of associated data points.
In other words:
For new and now available data:
Compute the first
component and the second component of the covariance matrix.
For older data:
Adjust the
covariance matrix by taking the weighted average of the components.
No comments:
Post a Comment