Cluster computing

Sunday, December 27, 2015

Q: Given a circular linked list. Find the longest sequence of numbers, where a sequence starts with any number, for example 3, and ends when you see that number again, another 3,
Imagine the circular linked list
3 8 9 7 2 1 3 4 6 [3] same as the first element i.e 3
The longest sequence would be 3 8 9 7 2 1, the other candidate being 3, 4, 6
Finding for instance, starting at 8 and getting to the same 8 wouldn't count as a valid sequence.
Courtesy : CareerCup
A:
We build a hashtable of the last encountered index of the elements in the list. If the element is seen again, we calculate the difference in the new and the previous index and as long as it is less than the total count of elements in the list and more than the previously encountered difference, we update the difference and note down the start. After all the updates are made, we have a start index and the highest difference between two occurrences of same element.

void PrintLongestSequence(List<int> numbers)
{
int start = 0;
int length = 0;
var h = new Hashtable<int,int>();
for (int i = 0; i < numbers.Count; i++)
{
    if (h.Contains(numbers[i]))
    {
         var old = h[numbers[i]];
         var new = i;
         var difference = new - old;
         if (difference > length)
         {
               start = old;
               length = difference;
         }
     }
     h[numbers[i]] = i;
}
Console.WriteLine();
for (int i = start; i < start+length; i++)
    Console.Write("{0} ", numbers[start]);
Console.WriteLine();
}

Saturday, December 26, 2015

Given an 8 x 8 matrix, find all possible paths, moving one cell downwards or one cell to the right (one cell per movement) from cell (0,0) to cell (7,7)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace MatrixWalk
{
    class Program
    {
        const int rows = 8;
        const int cols = 8;
        static void Main(string[] args)
        {
            var path = new List<Tuple<int,int>>();
            int totalPaths = 0;
            GetPaths(0, 0, ref path, ref totalPaths);
        }
        public static void GetPaths(int x, int y, ref List<Tuple<int, int>> path, ref int totalPaths)
        {
            if (x < rows && y < cols)
                path.Add(Tuple.Create(x, y));
            if (x == rows-1 && y == cols-1)
            {
                totalPaths += 1;
                PrintPath(path, totalPaths);
            }
            if (x + 1 < rows)
                GetPaths(x + 1, y, ref path, ref totalPaths);
            if (y + 1 < cols)
                GetPaths(x, y + 1, ref path, ref totalPaths);
            path.RemoveAt(path.Count - 1);
        }
        public static void PrintPath( List<Tuple<int, int>> path, int totalPaths)
        {
            Console.WriteLine("-------------------------------");
            Console.WriteLine("Path number: {0}", totalPaths);
            Console.WriteLine("Path Size: {0}", path.Count);
            System.Diagnostics.Debug.Assert(path.Count == rows+cols-1);
            for (int i = 0; i < path.Count; i++)
            {
                Console.Write("[{0},{1}]", path[i].Item1, path[i].Item2);
            }
            Console.WriteLine();
            Console.WriteLine("-------------------------------");
        }
    }
}

Friday, December 25, 2015

Today we continue discussing the DMB algorithm from the paper titled optimal distributed online prediction using mini-batches by Dekel, Gilad-BachRach, Shamir and Xiao. One of the theorems they propose is one that bounds the expected regret over m samples.
The theorem is stated this way : Let f(w,z) be an L-smooth convex loss function in w for each input in the stream and assume that the stochastic gradient (Big-Delta) computed at the input zi has a std deviation squared variance for all predictions. If the update rule phi has the serial regret bound Psi as a function of this variance and m, then the expected regret of the algorithm over m samples is at most
(b + mu) times Psi as a function of ( std dev squared / b and upper ceiling of m / (b+mu)).

Specifically if Psi is a function as 2 D -squared L + 2 D std dev square-root(m) which was established earlier as the bound in the ideal serial solution and we set the batch size as b = m power rho for any rho in the range 0 to half, the expected regret bound is 2 D sigma sq.rt(m) + sigma sq.rt(m)
This bound is asymptotically equal to the ideal serial bound and even the constants in the dominant terms are identical. The bound scales nicely with the network latency and the cluster size k, because mu which actually scales logarithmically with k does not appear in the dominant sq.rt(m) term. On the other hand, the naïve no communication theorem is worse than this bound by a factor of square-root(k)

#codingexercise
Question : Find if two binary trees have common subtrees
answer : We find the inorder traversals of the trees as strings and then find common substrings in the two representations.
        static void FindCommonSubTrees(Node tree1, Node tree2)
        {
            var substrs = GetSubstrings(GetInOrder(tree1));
            foreach (var str in substrs)
                Console.WriteLine(str);
            var otherstrs = GetSubstrings(GetInOrder(tree2));
            foreach (var str in otherstrs)
                Console.WriteLine(str);
            var intersection = substrs.Intersect(otherstrs);
            foreach (var str in intersection)
                Console.WriteLine(str);
        }
        static List<string> GetSubstrings(string input)
        {
            var ret = new List<string>();
            ret.Add(input);
            for (int i = 0; i < input.Length; i++)
            {
                for (int pos = 0; pos <= input.Length; pos++)
                {
                    if (pos + i < input.Length)
                    {
                        var substr = input.Substring(pos, i);
                        ret.Add(substr);
                    }
                }
            }
            return ret;
        }

Thursday, December 24, 2015

int GetFirstEvenCountOccurence(List<int> numbers)
{
var countable = new Hashtable();
for (int I = 0; I < numbers.Count; I++)
{
if ( countable.contains(numbers[I])
    countable[numbers[I}] += 1;
else
    countable.Add(numbers[I], 1);
}
for (int I = 0; I < numbers.Count; I++)
{
if (countable[numbers[I]] %2 == 0)
     return numbers[I];
}
return -1;
}

Today we continue discussing the DMB algorithm where a deterministic update rule is applied in a distributed environment. This technique resembles the serial mini-batch technique described earlier and is therefore called the distributed mini-batch algorithm (DMB).
The algorithm processes the input stream in batches j = 1,2, where each batch contains b + mu consecutive inputs where b is the batch size and mu is the additional inputs. In each batch, all of the nodes use a common predictor wj. While observing the first b inputs in a batch, the nodes calculate and accumulate the stochastic gradients of the loss function f at wj. Once the nodes have accumulated b gradients altogether, they start a distributed vector sum operation to calculate the sum of these b gradients. While the vector sum operation completes in the background, mu additional inputs arrive and the system keeps processing them using the same predictor wj. The gradients for these additional mu inputs are discarded. Once the vector sum completes, each node holds the sum of the b gradients collected during batch j. Each node divides this sum by b and obtains the average gradient, which is denoted by gj-bar. Each node feeds this average gradient to the update rule phi, which returns a new synchronized prediction wj+1. While only the first b/k gradients are used to compute the next predictor. all b+mu inputs are counted in the regret calculation. If the DMB algorithm is applied to a spanning tree in the network, the aggregation is performed at the root and broadcasted.

Wednesday, December 23, 2015

Today we continue with the paper optimal distributed online prediction using mini-batches by Dekel, Gilad-BachRach, Shamir and Xiao. We now review the Distributed Mini-Batch for Stochastic online prediction where the authors show that the mini-batch idea can be exploited to obtain nearly optimal regret bounds. They begin with the assumption that any communication over the network incurs a latency. The network is an undirected graph G over the set of nodes, where each edge represents a bi-directional network link. The latency incurred between u and v is therefore proportional to the graph distance between them.and is maximum at the diameter The network is also assumed to have limited bandwidth but instead of quantifying it, they require that the communication load at each node remains constant, and does not grow with the number of nodes k or with the rate at which the incoming functions arrive.
The communication model used with this network is one that allows a distributed vector sum operation. This operation begins with each node holding a vector vj, and ends with each node holding the sum of the vectors vj from 1 to k. The messages are transmitted along a rooted minimum depth spanning tree where the leaves send their vectors to their parents who aggregate and in turn send the result to their parents until it reaches root. The root computes the sum from all and broadcasts it back to all the nodes.
There are two advantages with this model : one that each up-link and each down-link in the minimum spanning tree is used only once The vector sum operations can be started back to back. These vector sum operations will run concurrently without creating network congestion on any edge of the tree and these network operations are non-blocking where each node can continue processing inputs while the vector sum operations take place in the background. This lets us overcome the restriction imposed by network latency.
The DMB algorithm is a distributed mini-batch algorithm. where it runs on parallel on each node in the network again utilizing the summation form we discussed with the communication model above.
#codingquestion
Given an integer array, find the first element with an even number of occurrences.
First of all this problem is not the same as finding the first duplicate because the total number of times that duplicate can occur can still be odd. Hence we must process all the numbers in the array into a count table with elements as keys and their count of occurrences as values. Therefore a hashtable with constant time lookup will help. Given that the count of occurrences can be arbitrary, we must safeguard against no even number count of occurrences or other edge conditions. the first pass over the integer array populates the count table and the second pass over the integer array checks for the first element with a parity. To conserve space, we can use a bitmap where the bit at the index given by the number or element of the integer array is tested for parity but we would need another data structure to say whether an element was ever encountered.

Tuesday, December 22, 2015

Workflow for setting up amazon metrics to be collected:

1) Edit Billing preferences as shown:

Notice that the AWS billing preferences provide three options:

The first option is the one that we will be customizing for our customers hence this option is merely for validity check.

The second option is the one that we are interested in because we can subscribe to CloudWatch then.

The third option can also be used with automation and hence we set it up with a S3 bucket.

Notice that the policy to be applied for a bucket named raja0034 S3 billing is as follows:

{

"Version": "2008-10-17",

"Id": "Policy1335892530063",

"Statement": [

{

"Sid": "Stmt1335892150622",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::386209384616:root"

"Action": [

"s3:GetBucketAcl",

"s3:GetBucketPolicy"

"Resource": "arn:aws:s3:::raja0034billing"

{

"Sid": "Stmt1335892526596",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::386209384616:root"

"Action": [

"s3:PutObject"

"Resource": "arn:aws:s3:::raja0034billing/*"

}

]

}

For option 2) of selecting alarms and metrics, we have to set it up on CloudWatch which displays both ECS and EBS Metrics:

Selected Metrics could then include:

Note that the metrics are categorized by service. If you wanted EC2 and EBS specifically, you would change the region to the one where your instances are :

Notice we have changed from East region to West.

Then you would see the metrics that we want to grab statistics for which we will do periodically and save it in a database as time series data for historical information and query.
#codingexercise
Void sizeOf1or0onlyRectangle (int [,] matrix, int row, int col, int startx, int starty, int x, int y, ref int size, dir)
{
If (x+1 < row && y +1 < col &&
Matrix [x+1,y+1] = matrix [x, y] &&
Matrix [x, y+1] = matrix [x,y] &&

Matrix[x+1, y] = matrix [x, y] && dir == diagonal && all_elements_in_new_row_are_same(matrix, startx, starty, matrix[x,y]) &&

all_elements_in_new_col_are_same(matrix, startx, starty, matrix[x,y]))

sizeOf1or0onlyRectangle(matrix,row, col, x+1,y+1, ref size +(x+1)-startx + (y+1)-starty + 1, dir );

If ( y +1 < col &&

Matrix [x,y+1] = matrix [x, y] && dir == horizontal && all_elements_in_new_row_are_same(matrix, startx, starty, matrix[x,y]))

sizeOf1or0onlyRectangle(matrix,row, col, x,y+1, ref size +(y+1)-starty, dir );

If (x+1 < row &&

Matrix[x+1, y] = matrix [x, y] && dir==vertical && all_elements_in_new_col_are_same(matrix, startx, starty, matrix[x,y]))

sizeOf1or0onlyRectangle(matrix,row, col, x+1,y, ref size +(x+1)-startx, dir);

}