Thursday, November 26, 2015

We bring out the similarities between the Batch Gradient descent and the Conjugate gradient methods both of which we have discussed earlier.
Essentially both try to solve a linear equation of the form Ax = b
Both of them calculate the gradients for descent and the gradient is a vector field that for a given point gradient points in the direction of the steepest change for the objective.
Steepest descent finds orthogonal directions by stretching the descent in a direction as much as possible.
The steepest descent algorithm proceeds this way:
set I to 0
set residual to b - Ax
set delta to residual-transposed.residual.
Initialize delta-0 to delta
while I < I-max and delta > epsilon^2 delta-0 do:
    q = Ar 
    alpha = delta / (r-transposed. q)
    x = x + alpha.residual
    If I is divisible by 50
        r = b - Ax
    else
        r = r - alpha.q
    delta = r-transposed. r
     I = I + 1

The reason for the use of the number 50 is that after a certain number of iterations, the exact residual is recalculated to remove accumulated floating point error. Of course, the number 50 is arbitrary and for large n, square_root(n) may be more appropriate instead.
If the tolerance is large. the residual need not be corrected at all. If the tolerance is close to the limits of the floating point precision of the machine, a test should be added after delta is evaluated, to check if delta is less than epsilon^2.delta-0. and if this test holds true, the exact residual should be recomputed and delta reevaluated. This prevents the procedure from terminating early due to floating point roundoff error.
courtesy: Conjugate Gradient Method by Jonathan Richard Shewchuk

Tuesday, November 24, 2015

#codingexercise
double GetStandardDeviation(List<int> numbers)
{
double count = numbers.Count();
if (count <= 0) return 0;
double sum = numbers.Sum();
double mean = sum / count;
double variance = 0;
foreach (var number in numbers)
     variance += (number-mean)*(number-mean);
double stddev = Math.sqrt(variance/count);
return stddev;
}

Parallel and Concurrent programming of Independent Component Analysis:

A famous example of Independent Component Analysis is given by the “cocktail party problem”, where a recording of people talking simultaneously in a room is studied to separate out the underlying speech signals from the sample data. This is based on two assumptions – one that the source signals are independent of each other and two the values of each source signal have non-gaussian distributions. 

When the data is represented by a random vector x = (x1, x2, … xm) Transposed and the components as the random vectors (s1, s2, … sn) Transposed. The task is to transform the observed data x, using a linear static transformation W as s = Wx into maximally independent component s measured by some objective function on (s1, s2 … sn) of independence.

The data as represented by x above is considered to be a sum of the independent components sk weighted by the mixing weights ai,k for each xi. This is written as x = A.s Therefore W = A-inverse.

In ICA, the main goal is to compute the unmixing matrix W. The linear equation s= Wx can be solved by batch gradient descent to optimize W’s likelihood. Recall Likelihood is another term for conditional probability of the training vector from the Bayes definition mentioned earlier. The components can then be determined as the expression as

the singular column matrix [ 1-2g(wi Transposed . xi ) ] . xi Transposed

Here the wi.Transposed .x is the  objective function which is defined as the mixing weighted data.

The first term of unity is the starting data point and g is the gradient. The components are updated by taking the difference between the starting and the adjustment. This batch gradient descent is performed till the components stabilize.  At which point they can be used as predictors.

The original algorithm calls for the processing of the full training set before each update of the components.

In parallelization with sub groups of data, this is computed by each mapper and then aggregated by the reducer. The expression above can be calculated on each dataset separately but the reducer performs the updates to the components by aggregating the sum.

In modified parallelization, we retain the same procedure for every new data we encounter as the data becomes available.  But for the already calculated old data where we have the components determined so far as well as the summary which is the expression above, we could use the summaries to calculate the components anew because there is no concept of overlapping and non-overlapping of data since all the data participates in the iterative update of the components – just that we use the summary instead of recomputing the sum. This is a trivial rearrangement of the subgroups of data into old and new as it becomes available but applying the same parallelization as with partitioned subgroups of data. An alternative could be to try and merge the components by weights against the summary because each component is linearly transformed to the next update.
Note that the components are batch updated so the merging will be across all components.

Modified Principal Component Analysis:

PCA can be considered as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipse is small, then the variance along that axis is also small, and by omitting the axis, the loss can be tolerated.

To find the axes of the ellipse, we subtract the mean from each variable to center the data around the origin. Then we compute the covariance matrix of the data and then calculate the eigen values and the corresponding eigenvectors of this covariance matrix. Then we must orthogonalize the set of eigen vectors and normalize each to become unit vectors. The resulting eigen vectors become the axis of the ellipsoid fitted to the data. The proportion of the variance that each eigenvector represents can be calculated dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. The eigenvectors are ordered such that the greatest variance by some projection of the data comes to lie on the first co-ordinate, the second greatest variance on the second co-ordinate and so on.

When we parallelize this algorithm on chunks of data, each mapper computes the principle eigenvectors of the covariance matrix as 1/m times the summation of the support vector and support vector transposed from I = 1 to m and minus the second term comprising of mean and mean transposed. The latter component consisting of only means can also be expressed as summation form. Therefore the covariance matrix is all in the summation form.

The modified algorithm adjusts the summation forms the same way as in the examples of adjusting mean by taking weighted average of the summation where the weights are the number of associated data points.

In other words:

For new and now available data:

   Compute the first component and the second component of the covariance matrix.

For older data:

  Adjust the covariance matrix by taking the weighted average of the components.

Monday, November 23, 2015

In Naïve Bayes, the classic algorithm is as follows:  
Estimate probability of a training vector given a condition exists, (Aka likelihood) 
Estimate Probability of a training vector given  condition does not exist. 
We also calculate probability that the condition exists (Aka prior) and the condition doesn't exists. These are referred to as weights to the above. 
this is applied to entire data.  

Now we modify the previous algorithm for incremental data as follows: 

For new data: 
Calculate As above 

For old data: 
 Sum the previous calculated Target directly with the new target 

This is because the denominator remains the same and the weights remain the same. 
We now discuss discriminant analysis which uses concepts similar to Bayes.  In Bayes, we expressed the conditional probabilities. In DA we use the conditional probability and the prior for each target variable.  This method applies to normal distribution data only. Consequently we calculate the mean and the standard deviation. For computing the mean incrementally as and when data becomes available,  we could find the mean of the new data and take the weighted average of the previous and current mean. For computing the standard deviation as and when data becomes available, we use variance from the previous data and the variance from the new data and take their weighted mean. Then we take the square root of the result to find the new standard deviation. In other words we do the same as what we did for finding mean incrementally because the support variables or their squares are both in summation form. We argued earlier that summation form yields itself to parallelization as well as incremental updates. 
Let us find out now how to build discriminant analysis incrementally.thus far we have said that mean, standard deviation and probability can each be calculated incrementally. And DA can be expressed in a form that has two components : the first component is based on squared differences from the mean as numerator and standard deviation as denominator and the second component is based  on the natural logarithm of the probability. Both the components clearly rely on parameters that can be found incrementally. Consequently, the DA can be calculated at every stage. 







Sunday, November 22, 2015

Today we review a book called "Strategy Rules" by Yoffie and Cusumano.
Yoffie is an Harvard Business School professor and  Cusumano is an MIT professor.
They offer a new take on three giants - Bill Gates, Andy Grove and Steve Jobs by bringing them together into one how-to guide on strategy and offer a new take on these three giants of entrepreneurship and technology by bringing them together.
According to the authors, the three men differ widely in their personalities but followed the same five rules for strategy and execution.
1. Look forward, Reason back. The first rule was to look forward into the future and reason back to the actions required today.  They imagined what the world could be with their vision and have made a tremendous impact on the world. All three had the ability to determine in detail what needed to happen immediately to turn vision into reality.
2. Make Big Bets, Without betting the company. Gates Grove and Jobs were all leaders but not reckless They knew how to time or diversify their big bets so that even huge strategic bets were not reversible.
3. Build platforms AND ecosystems.  Another important rule, the authors write was to build platforms and ecosystems as opposed to pursuing a product strategy. This we know from our review of "The Responsible Entrepreneur" is not just game changing but self-sustaining. As an example,  Gates chose not to sell his product DOS to IBM that requested it but retained the right to license the system to other companies. The rest is history.
4. Exploit leverage and power -  According to the authors all three men could play judo and sumo Judo requires using the opponents strength and turn it into weaknesses. As an example, Job could successfully negotiate with the music companies for a license to their music. Apple was so tiny and even its market share was 2%, yet the music companies made the agreement to Apple.  This laid the foundation of the iTunes revolution.  At the same time, the three did not hesitate to use their power.
5. Shape the company around your personal anchor - Personally each had their own taste, strengths and interest - Gates the coding genius, Grove - a precise engineer, and Jobs a wizard at design. The companies they built reflected these strengths. At their peak, Microsoft, Apple and Intel were worth 1.5 trillion. And these companies are the talk of the world.





1 / 1

In Naïve Bayes, the classic algorithm is as follows:  

Estimate probability of a training vector given a condition exists, (Aka likelihood) 

Estimate Probability of a training vector given  condition does not exist. 

We also calculate probability that the condition exists (Aka prior) and the condition doesn't exists. These are referred to as weights to the above. 

this is applied to entire data.  

Now we modify the previous algorithm for incremental data as follows: 


For new data: 

Calculate As above 


For old data: 

Sum the previous calculated Target directly with the new target 


This is because the denominator remains the same and the weights remain the same.