Cluster computing

Thursday, July 9, 2015

Today we discuss MapReduce algorithm in Hadoop with reference to their documentation.
MapReduce processes vast amounts of data in parallel on large clusters(thousands of nodes) in a reliable fault-tolerant manner.
A MapReduce job usually splits the input data set into independent chunks which are then processed in a shared nothing model by map tasks. The outputs of the maps are then input to the reduce tasks. The framework stores the input and output of a job in a filesystem, schedules the processing, monitors the activity and re-executes failed tasks. The framework employs a master slave model for the execution of the map and reduce.
While the jobs are monitored and completed by the framework, their configuration including the input or output locations are specified by the applications.
At each stage of the processing - the map, combine, reduce steps take the <key, value> pairs as the input and output. Both key and value implement the writable interface and the key classes have to implement the writablecomparable interface.

For eg:
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{
public void map (LongWritable key, Text Value, OutputCollector<Text, IntWritable> output, Reporter reporter)
:// eg tokenizes a sentence into words
}
}

public static class Reduce MapReduceBase implements Reducer<Text, IntWriteable, Text, IntWriteable> {
public void reduce (Text key, Iterator<IntWriteable> values, OutputCollector<Text, IntWriteable> output, Reporter) {
:// counts the number of words encountered
}
}
The Mapper splits the line into tokens which the tokenizer emits as a keyvalue pair of the words and the counts.
The Reducer sums up the values which represent occurrences for each key.

Together the mapper and the reducer form the payload to the framework and are specified by the applications.
The Mapper decides on the number of maps based on the input data and blocksize. The Reducer has three primary phases - shuffle, sort and reduce. Shuffle takes the sorted output of all the mappers as input. Reducer inputs are grouped by keys during sort. Both shuffle and sort occur simultaneously as map outputs are fetched. The output of the reducer is not sorted. A partitioner partitions the key space. It controls the partitioning of the keys of the intermediate map-outputs. By default, a hash function is used to derive the partitions. A reporter reports progress and relevant application-level status messages. An OutputCollector collects data output by the Mapper or the reducer.

Wednesday, July 8, 2015

1) Given two strings find the longest common substring. We can use dynamic programming for this.

We define the optimal substructure as follows:

Let S be a string of length m and T be a string of length n. Recall that the pattern matching of strings involved computing a prefix. Here too we find all possible prefixes and taking any two prefixes, we choose the one with the longest suffix.

This can be written as

The longest common suffix =

LCSuffix(S1p-1, T1..q-1)+1 if S[p] = T[q]
0 otherwise

Then we can choose the longest common substring as the max of the longest common suffixes for i ranging from 1 to m and j ranging from 1 to n.

LCSubstr(S,T) = max 1<= i <=m and 1<= j <=n LCSuffix(S1..i, T1..j)

We will investigate suffix tree for solving this. A suffix tree helps arrange the strings as a path from the root to the leaves with the strings denoted by numbers and appearing at the leaves of the suffix tree. The problem then can be translated to one which finds out which strings appear below each node. The tree is traversed bottom up ( which means level order traversal with the results reversed level wise) and a bit vector is maintained for all possible strings. Alternatively the lowest common ancestor of two strings can be found out.

Suffix tree reminds us of a trie. A prefix tree is also a trie. Both have special terminators on the nodes to signify the end of a string or word.

Another interview question is as follows:

Given a csv of name and age of persons, sort it by age. For this we could use a list data structure that already has a method sort and can take an IComparer for duplicates. And duplicate elements can also be stored as lists within the overall list. Of course, .Net also provides a tuple collection.

Another interview question is based on levels of friendship:
Given the input as
A: B C D
D: B E F
E: C F G
This produces the output as
Level 1 : B, C, D
Level 2 : E, F
Level 3 : G

The given example has no cycles in the graph. So starting at the root as the chosen node, we can form a tree and then complete a Level wise traversal to find the next levels of friendship.

However, the problem may not be restricted to acyclic graph even if it is directed. Hence we maintain a queue for each level

Monday, July 6, 2015

We review another example of moving on a checkerboard . We are given an n by n checkerboard it's valid to move ftom any of the boxes on the bottom to goto any of the boxes in the top board. A move can only be made to a square on the top left, directly above and a top right as long as they are within the square checkout each move results in a positive or negative amount of money . We have to find the optimal set of moves that maximizes profit.
We apply dynamic programming by defining the opt optimal substructure.
The current position i,j can be moved to from one of three positions in the row below. Because we make an opt choice in this move, the previous set of Moves upto and inclusive of the previous position is also optimal.
Hence we define as recurrence relationship as

Next move = p [(i-1,j-1), (i,j)] + d (i-1,j-1)
= p [(i-1,j), (i,j)] +d (i-1,j)
=p[i-1, j+1 , (i,j)] + d (i-1, j+1)
And repeat for n steps.
The bottom up method would be like this:
let d[1..n, 1..n] and s[1 .. n-1, 2..n] be new tables for cost and index of optimal

for i = 1 to n

d[i,i] = - infinity

for l = 2 to n // chain length

for i = 2 to n-l+1

j = i+l-1

d[i,j] = - infinity

q = max ( p [(i-1,j-1), (i,j)] + d (i-1,j-1)

, p [(i-1,j), (i,j)] +d (i-1,j)

, p[i-1, j+1 , (i,j)] + d (i-1, j+1))

if q > d[i,j]

d[i,j] = q

s[i,j] = k

return d and s

The top down strategy using recursion is as follows:

RecursiveStepping (i, j, d)

If I == n-1
Return d(i,j)
Q = max (p [(i-1,j-1), (i,j)]+ RecursiveStepping (i-1,j-1,d)

, p [(i-1,j), (i,j)] +RecursiveStepping(i-1,j,d)

, p[i-1, j+1 , (i,j)] + RecursiveStepping (i-1, j+1, d))
If q > d (i,j)
D (i,j) = k
Return d(i,j)

We review a coding question to determine the count of numbers that have 4 as a digit from 1 to n.

int getCount(int n)
{
int count = 0;
for (int i = 0; i < n; i++)
if (has4(i)) count ++;
return count;
}

bool has4(int n)
{
while (x != 0)
{
if (x%10 == 4)
return true;
x = x / 10;
}
return false;
}

We review webhooks today. We discussed that they were useful for notifications in REST API implementation.

In addition, Web hooks do the following :
1) keep Data in sync
2) give other applications a chance to do something
3) provide necessary params on certain events

Web hooks take a URL to send the notification to. A "postbin" web service will receive and log the notifications so that it can monitor all POST HTTP requests it receives. This URL is taken from a registration form hosted by the publisher. When this notification arrives, the subscriber can choose to respond by making additional API calls based on the information in the notification.

If the registration is to be avoided, APIs can optionally take a callback parameter that takes in a URI. This way during the execution of this specific API, a POST method will be called on the URI provided with a predetermined template of information. This lets the caller subscribe and unsubscribe dynamically.

Saturday, July 4, 2015

#codingexercise
Write an efficient program to reverse the bits of a number.
We do this by looping through the number and testing each bit. If the bit is set at position i then we set the bit at total_number_of_possible_bits -1 - i in the reverse number.
unsigned int reverseBits(unsigned int num)
{
unsigned int number_of_possible_bits = sizeof(num) * 8;
unsigned int reverse_num = 0;
for (int i = 0; i < number_of_possible_bits; i++)
{
int temp = (num & (1 << i );
if (temp)
reverse_num |= (1 << ((number_of_possible_bits - 1) - i));
}
return reverse_num;
}

Another coding interview question is how to count the number of binary search trees that are within a numerical range?.
This can be found by performing a traversal
If the subtrees are in range the root is also in range.

bool getCount(Node root, ref count)
{
if (root == NULL) return true;
Bool l = false;
Bool r = false;
if (root.left) l = getCount(root.left, count);
if (root.right) r = getCount(root.right, count);
if ( l && r && inRange(root.left, root.right))
count++
return false;
}

Another interview problem is schedule a meeting
Given an employee array of a large number of employees, find a window of time with maximum duration for a meeting that suits all the employees in the input.
Since each employee has different availabilities we can first find the availability ranges for each employee. This may result in a number of time Slots greater than the number of employees. Then taking time as increments of time slices such as a minute, we can write them down as sequences. After that we can find the longest common subsequence. This approach can scale to any length and number of time range of employees when the time slice increments are more granular. It also finds a maximum number of employees who can make it to a meeting when all cannot.

This problem can be attempted in many ways but we focus on one based on max flow algorithm. The reason we choose maximum flow is because we want to find the window that is maximum. We consider Ford Fulkerson algorithm.
Before using Ford Fulkerson, we have to transform the problem from multisource multisink to single source single sink by adding a consolidated source and a consolidated sink. Each flow is considered as a viability for an employee and the max flow is when we can augment this flow for all the employees. We consider granular time slots for augmenting this maximum flow. Then we add increment timeslots for maximum duration of the meeting.

Matrix-chain-order problem.

A matrix chain order determines the optimal number of scalar multiplications needed to compute a matrix chain product. Even though matrix multiplication is associative, the order of parenthesizing the product impacts the cost of the product. This is because the order determines the number of scalar multiplications to be performed because a matrix multiplication cost is based on the columns of the preceding matrix to be multiplied with the number of rows from the succeeding matrix. Therefore, by fully parenthesizing the matrix product of A1, A2, … An matrices we can compute the total cost and consequently the optimal solution will have an order of multiplication corresponding to the lowest cost.

To solve this problem, we apply dynamic programming by following the four-step sequence:

1. Characterize the structure of an optimal solution

2. Recursively define the value of an optimal solution

3. Compute the value of an optimal solution

4. Construct an optimal solution from computed information

The first step can be performed by arguing that finding the optimal solution of the product of A1, A2 … An is the same as finding the optimal solution of two subproblems obtained by splitting the range into two involving A1…Ak and Ak+1 …An and then combining these optimal subproblem solutions.

Next the recurrence is described with a termination condition as zero cost multiplication when the chain consists of only one matrix. The progress condition can be described as the minimum cost for the range i to j as the minimum of the sum of the cost associated with the sub problems plus the scalar multiplication cost of the results where this sum is calculated one for each chosen k before taking the minimum.

Step 3 can be performed by using a bottom up approach. Given that each matrix Ai has pi-1 x pi dimensions, a bottom up approach solves incremental dimensions so that the solution of the matrix involving previous dimensions can be readily looked up. There are two tables maintained – one that stores the cost of computing a matrix chain product and another auxiliary table that records which index k achieved the optimal cost in computing the cost stored.

Finally we can compute the optimal solution using the auxiliary index table above. We can find the k from the auxiliary table that stores it. We can determine the earlier matrix multiplications recursively since s[1,s[1,n]] determines the last matrix multiplication when computing the preceding component and the s[s[1,n]+1,n] determines the last matrix multiplication when computing the succeeding component of the product of the two subproblems. Again the termination condition is the single element chain.

Moreover, as with dynamic programming problems with bottom-up approach, there is also a corresponding recursive solution using a top down strategy and optional memorization to improve efficiency . That top down stragey is based directly on the recurrence mentioned above. We still maintain a table of costs but the order of filling the table is more like the order of recursion.

The bottom up strategy:
Matrix-Chain-Order(p)
n = p.length - 1
let m[1..n, 1..n] and s[1 .. n-1, 2..n] be new tables for cost and index of optimal
for i = 1 to n
m[i,i] = 0
for l = 2 to n // chain length
for i = 1 to n-l+1
j = i+l-1
m[i,j] = infinity
for k = i to j-1
q = m[i,k] + m[k+1,j] + pi-1pkpj
if q < m[i,j]
m[i,j] = q
s[i,j] = k
return m and s

The top down strategy:
Recursive-Matrix-Chain(p, i, j)
if i == j
return 0
m[i,j] = infinity
for k = i to j - 1
q = recursive-matrix-chain(p,i,k) + recursive-matrix-chain(p, k+1,j) + pi-1pkpj
if q < m[i,j]
m[i,j] = q
return m[i,j]

Friday, July 3, 2015

#codingexercise

/***

Add 2 numbers represented as linked list and return the resulting sum as a linked list.

How are numbers represented as linked list?

Suppose we have 459, it is represented as:

4->5->9->null

So 4 is the head and 9 is the tail.

Say we are given 2 numbers like: 459 and 891, their sum is 1350. That will be represented as:

1->3->5->0.

Implement the function that takes head pointer of 2 numbers as linked list and

returns a head pointer to a new linked list, which is basically the sum of both the numbers.

***/

Node GetSum(Node root1, Node root2)

{

// read the value from root 1

int val1 = GetValue(root1);

// read the value from root 2

int val2 = GetValue(root2);

// write the value of root1 + root2

if ( val1 + val2 < val1 || val1 + val2 < val2) // overflow

return NULL; // or raise exception

return WriteValue(val1 + val2);

}

Node WriteValue(int val)

{

Node root = NULL;

int v = val;

int len = 0;

while (v != 0 ) {

len = len + 1;

v = v / 10;

}

v = val;

Node Prev = null;

for ( i = 0; i < len ; i++)

{

var n = new Node();

n.Data = (Math.floor(v / Power(10, len - i - 1))) % 10;

n.Next = Null;

if (Prev) {

prev.next = n;

prev = n;

}

else

root = n;

}

return root;

}

int GetValue(Node root1)

{

Node cur = root1;

int val1 = 0;

int len = 0;

while(cur)

{

len = len + 1;

cur = cur.Next;

}

// len = 2

cur = root1;

for ( i = 0; i < len; i++)

{

val1 += cur.Data * Power(10, len - i - 1);

cur = cur.Next;

}

// val1 = 29

}

NULL list2

list1 NULL

UnitValue List2

list1 UnitValue

ArbitraryValue List2

List1 ...

Overflow & Underflow