Cluster computing

Tuesday, November 24, 2015

Modified Principal Component Analysis:

PCA can be considered as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipse is small, then the variance along that axis is also small, and by omitting the axis, the loss can be tolerated.

To find the axes of the ellipse, we subtract the mean from each variable to center the data around the origin. Then we compute the covariance matrix of the data and then calculate the eigen values and the corresponding eigenvectors of this covariance matrix. Then we must orthogonalize the set of eigen vectors and normalize each to become unit vectors. The resulting eigen vectors become the axis of the ellipsoid fitted to the data. The proportion of the variance that each eigenvector represents can be calculated dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. The eigenvectors are ordered such that the greatest variance by some projection of the data comes to lie on the first co-ordinate, the second greatest variance on the second co-ordinate and so on.

When we parallelize this algorithm on chunks of data, each mapper computes the principle eigenvectors of the covariance matrix as 1/m times the summation of the support vector and support vector transposed from I = 1 to m and minus the second term comprising of mean and mean transposed. The latter component consisting of only means can also be expressed as summation form. Therefore the covariance matrix is all in the summation form.

The modified algorithm adjusts the summation forms the same way as in the examples of adjusting mean by taking weighted average of the summation where the weights are the number of associated data points.

In other words:

For new and now available data:

Compute the first component and the second component of the covariance matrix.

For older data:

Adjust the covariance matrix by taking the weighted average of the components.

Cluster computing

Tuesday, November 24, 2015

No comments:

Post a Comment