Tuesday, November 10, 2015

Let us look at  a few examples of using summation form to a few chosen algorithms that were selected because of their popularity with NIPS. The summation form is one that is suitable for MapReduce jobs.
1) Locally Weighted Linear Regression (LWLR) This one is solved by solving the normal equations A.theta = b where A is the matrix formed by summing components. Each component is a weighted combination of the training vector and its transpose. The term on the right hand side of the normal equation is also a summation of components but one consisting of the weighted combination of training vector and training label. Since there are two different kinds of summation, one set of mappers compute one kind and another set of mappers compute another kind. Two reducers respectively sum up the partial values for A and b, and the algorithm finally computes the solution by calculating A-inverse on b. If the weights are ignored, this problem reduces to the case of ordinary least squares.
2) Naive Bayes - we have to estimate conditional probabilities here. One that involves either the condition that a  support label occurs or another that it does not occur and calculate the probabilities for the support vector with these conditions. In order to do so on MapReduce, a summation is done over the support vector to be k for each support label in the training data to calculate the probability that the support of a support vector given a support label. Therefore in this case we need different kinds of mappers - one to compute the subgroup for the support vector to be k when the support label doesn't, another to calculate the subgroup for the support vector to be k when the support label occurs,  and another for subgroup all when the support label doesn't occur and another for the subgroup for all when the support label does occur. Notice that these four categories match precision and recall definitions. The reducer then sums up intermediate results to get the final results for the parameters.
3) Gaussian Discriminative Analysis -  The classic GDA algorithm needs to learn the following four statistics - probability that a support label will occur, mean with a support label, mean without a support label and a summation that the data occurs with the condition of the chosen support label.  This method uses the normal distribution of each class / label. The normal distribution is usually represented by the mean and standard deviation for the class. For each label we compute that the data belongs to it. The class with the highest probability is chosen. Consequently, the mappers are different for each of these subgroups.Finally one reducer aggregates the intermediate sums and calculate the final result for the parameters.
4) K-means Here we know how many clusters we want. Therefore, we can divide the dataset into that many subgroups. Each mapper runs through local data, determines the closest centroid and accumulates the vector sum of points closest to each centroid. These are emitted as key and tuple of partial sum of set and set index. Reducer aggregates sum from all mappers and calculates new centroid.
5) Logistic Regression (LR). In this algorithm, the formula that lends itself to MapReduce is one described as (1/1 + exp(-Theta-transposed times x)). Learning is done by fitting theta to the training data where the likelihood function can be optimized. In each iteration of the algorithm, the theta is improved. This is done with multiple mappers that sum up subgroups of partitioned data. The iterations also take previous gradient computed for improving the theta and use it in the current iteration. The reducer sums up the values for the gradient and hessian to perform the update for theta.
6) Neural Network: This uses a strategy called backpropagation.  Each mapper propagates its set of data through the network.  A three layer network is used for this purpose with two output neurons classifying the data into two categories. Each of the weights in the network will be adjusted by backpropagating the error that is used to calculate the partial gradient for each of the weights in the network. The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network.
7) Principal Component . This algorithm computes the principal eigenvectors of the covariance matrix which is expressed with the first term as 1/m times the summation of the support vector and support vector transposed minus the second term comprising of  mean and mean transposed. The latter term can also be expressed as summation form. Therefore the covariance matrix is all in summation form. The sums can be mapped to different cores and the reducer can aggregate them.
8) Independent component analysis -  This is a method where the observed data is assumed to be linearly transformed from the source data and tries to identify the independent source vectors. So the goal is to find the unmixing matrix W, Since the transformations are linear, we can use batch gradient descent  to optimize the W's likelihood. Each mapper therefore independently calculates the unmixing matrix and they are aggregated in the reducer.
#coding exercise
How to know which side of a sorted array has a steeper gradient
void PrintSideWithSteeperGradient(int[] numbers)
{
   if (numbers.length <= 0) {console.writeline("None"); return;}
   if (numbers.length == 1) {console.writeline("Both"); return;}
   int leftstart = 0;
   int leftend = numbers.length/2-1;
   if (numbers.length %2 != 0) { leftend = numbers.length/2;}
   int rightstart = numbers.length/2;
   int rightend = numbers.length -1;
   double leftgradient = 0;
   double rightgradient = 0;
   if (leftstart == leftend) leftgradient = numbers[leftstart];
   if (rightstart == rightend) rightgradient = numbers[rightend];
   if (leftstart < leftend) leftgradient = (numbers[leftend] - numbers[leftstart])/(leftend-leftstart);
   if (rightstart < rightend) rightgradient = (numbers[rightend] - numbers[rightstart])/(rightend-rightstart);
   if (leftgradient == rightgradient) Console.WriteLine("Both");
   else if (leftgradient > rightgradient) Console.WriteLine("Left");
   else Console.WriteLine("Right");
}


We return to the discussion of applying machine learning theorems with mapreduce

In expectation maximization  the expected pseudo count number is calculated and this is used to update the probability, mean and sum from the training data. The probability  is computed on subgroups by having each mapper compute the expected pseudo count for its subgroup and the reducer aggregating it and dividing it by m. To calculate mean, each mapper computes the partial sum of  the expected pseudo weight and the support  vector as well as the sum of the pseudo counts. The reducer then aggregates the partial sums and divides it. For the summation, each mapper calculates the subgroup sum of the expected pseudo count times the difference of the support vector from the mean times this difference transposed. As for the mean, the sum of expected pseudo counts
for the local subgroup is also turned in by the mapper. The reducer then sums up the partial result and divides them.

In Support Vector Machine - the linear SVMs primary goal is to optimize the minimum square of weight vector and a constant times the slack variable for either hinge loss or quadratic loss. such that the training vector times the displaced support vector is greater than 1 - slack variable. These are usually written as summation forms of delta and hessian. The mappers will calculate the partial gradients  and the reducer will sum up the partial results to update the weight vector.

No comments:

Post a Comment