Cluster computing: August 2016

Wednesday, August 31, 2016

Persistent connections versus pooling - php framework
When the traffic to one or more php application is high, the application may open several connections to the database server using tge same credentials. The database server may reject these saying that there are too many connections open and the application throws an exception saying CDBconnection could not open a connection. In such cases, the database server expects the application to not open new connections but to use some form of connection pooling. Pooling helps conserve the resources which would otherwise have been wasted one for every connection. Also, it makes the application run fast as the coonections dont have to be opened on every request. In addition to max_connections parameter to pooling, another parameter that plays important role is thread_cache_size. Requests for threads are satisfied by reusing threads taken from the cache whenever possible and only if the cache is enpty, a new thread is created. Typically, this variable is set so that tge fraction of requests for wgich threads are newly created is rather small.
Consequently in a high traffic server this maybe setvto a arbitrarily high value
However,
Php framework operates on a request basis. It does not support connection pooling. The alternatives are to use something else that the framework provides as we cannot use anything outside the framework. One such technique is to use persisted connections. We set this property to true in the application config file under the db section.
A caveat with this approach is that it does not address any issues originating from the sql used with a connection by the user or application. If there's a bug in the sql wjich leads to long running queries, then this won't mitigate that. However, the first step is to lower the number of connections opened before analysing any locks held and orphaned by SQL. It is therefore prudent to use register_shutdown_function that cleansup after each SQL execution.

# a microservice example
https://drive.google.com/file/d/0B-H0-bJJQAq5eFhuYjIyQXNFR3c/view?usp=sharing
https://drive.google.com/file/d/0B-H0-bJJQAq5aFlQLS0wZ3pPc28/view?usp=sharing
#codingexercise
Find if there exists a pair in an array that can add up to a sum that us divisible by k
We gave a few solutions yesterday.

Another approach for small range of numbers in the list is as follows:
bool hasPairDivisibleByK(List<int> nums, int k)
{
Assert (nums.all(x => x > 0));
Assert( k > 0);
Var min = nums.min();
Var max = nums.max();
Int low = (min + min ) /k;
Int high = (max + max) /k;
nums.sort();
For (int I = low; I <= high+1; i++)
If (hasPairSum(nums, I * k) == true)
Return true;
Return false;
}
bool hasPairSum(List<int> nums, int sum)
{
int left = 0;
int right = nums.Count -1;
if (nums.Count < 2 ) return false;
// nums.Sort();
while (left < right)
{
if(nums[left] + nums[right] == sum)
return true;
else if (nums[left] + nums[right] < sum)
left++;
else
right--;
}
return false;
}

Find minimum distance between two numbers in an array
Int getMinDistance(list<int> nums, int X, int Y)
{
Int min = int_max;
For(int i = 0; i < nums.count; i++)
For(int j = i + 1; i < nums.count; i++)

{

If (((nums[i] == X && nums[j] == Y) ||

(nums[i] == X && nums[j] == Y)) &&

Math.abs(i-j) < min)

Min = math.abs(i-j);

}

Return min;

Tuesday, August 30, 2016

#algorithm discussion continued
Today we continue reading the flow networks. We were looking at max flow min cut theorem and the properties associated with it. We saw how it applied to Ford Fulkerson method. Each iteration of the method augmented the flow by considering a path in the residual network until there is no more. For each path considered, the residual capacity of that path is found and augmented to the flow network.
How the path is chosen is not mentioned by the method. This is usually done with Breadth First Search
We saw that with BFS, the shortest path distance for all vertices in the residual network increases monotonically with each flow augmentation. We saw that in the forward direction, when we assumed the distance to decrease, we contradicted ourselves because the next vertex v was one hop away from the current vertex u and since this vertex was not already considered its distance was the sum of the shortest distance to current vertex and one hop. This would mean that the distance to the new vertex before and after the decrease would be the same which is a contradiction.Therefore this distance increases monotonically.
We considered the case when (u,v) is not part of the flow before the augmentation Now let us say that it is part of the flow after the augmentation.. This means the augmentation increased the flow from v to u. This algorithm always augments flow along the shortest paths and therefore the shortest path from s to u has (v,u) as the last edge. In this case, the shortest distance to v is one hop less than the shortest distance to u. This in turn must be less than the shortest distance to u after augmenation because we maintianed that the shortest distance to u did not change from the get go. Also we know that after augmentation, the distance between u and v are only one hop apart now that they are connected. So the shortest path distance between source and v before augmentation works out to be two hops less than the shortest path distance between source and v after augmentation which contradicts our assumption that augmentation decreased the shortest distance . Therefore such a vertex v cannot exist and the shortest distance increases monotonically.
#codingexercise
Determine the floor of an input value in binary search tree

Floor is the value of a node that is the largest of all those that are lower than the given one

Int floor(node root, int val)
{
If (root == null) return -1;
If (root.data == val) return val;
If (root.data > input) return floor(root.left, val);
Int floorright = floor(root.right, val);
If (floorright == -1)
Return root.data;
Else
Return floorright;
}

Determine if there are two elements in a list that adds up to a number divisible by k
bool hasPairDivisiblyByK(List<int> nums, int k)
{
for (int i = 0; i < nums.Count; i++)
for (int j = i+1; j < nums.Count; j++)
if ((nums[i] + nums[j]) % k == 0)
return true;
return false;
}

bool hasPairDivisiblyByK(List<int> nums, int k)
{
Var h = new Hashtable<int, int>();
for (int i = 0; i < nums.Count; i++)
h[nums[i] % k] += 1;
for (int i = 0; i < nums.Count; i++){
Int rmndr = nums[i] % k;
If (rmdr == 0)
Return h[rmdr] > 1;
If( h[k - rmndr] > 0)
Return true;

}
return false;
}

If we want to save on space while keeeping the time complexity less than O(n^2), we could sort the numbers and for each of the multiples of k that lie between a min and a max in the range, we could find pairs that add up to it.

Floor is the value of a node that is the largest of all those that are lower than the given one

Monday, August 29, 2016

#algorithm discussion continued
Today we continue reading the flow networks. We were looking at max flow min cut theorem and the properties associated with it. We saw how it applied to Ford Fulkerson method. Each iteration of the method augmented the flow by considering a path in the residual network until there is no more. For each path considered, the residual capacity of that path is found and augmented to the flow network.
How the path is chosen is not mentioned by the method. This is usually done with Breadth First Search
When the BFS is used to choose the augmenting path as a shortest path from s to t in the residual network, where each edge has unit distance, the algorithm is then called Edmonds-Karp algorithm
It can be shown that with this algorithm for all vertices, the shortest path distance in the residual network increases monotonically with each flow augmentation.
This is explained in terms of contradiction. The distance between the start and a vertex v that has not already been included is assumed to decrease and then we derive the contradiction. Let delta be the shortest path just before the decrease and delta-dash be the shortest path immediately after. We know that the decrease is one hop because BFS traversal is for neighbors. and The distance of the vertex u is not decreased.
Moreover, we know that the edge (u,v) was not part of the shortest distance earlier because if it were the shortest distance from source to v would have been less than or equal to the shortest distance from s to u plus one hop. This in turn would have been smaller than source s to u after the current decrease of the shortest distance from s to u together with the one hop to v which we just now assumed. This again in turn would have been equal to the shortest distance from source to v after the decrease because the distance from s to u plus one hop is the distance from s to v. This means that the distance from source to v before and after the decrease is the same which is a contradiction.
The same reasoning by contradiction works out in the reverse direction. Together by both reasoning in both directions, the assumption that such a vertex v exists is proved false. Hence the shortest path distance in the residual network increases monotonically.

Question: Why does streaming work with aggregates in processing graphs and tables ?
Answer : This is because the aggregate can be projected as and when the results arrive and they were the ones being done in batch.
User defined functions and user defined operators can also be used similarly.

The Microsoft StreamInsight Queries follow a five step procedure:

1) define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis

2) define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval

3) Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query

4) Bind the query to a consumer. This could be to a console. For example

a. Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();

5) Run the query and evaluate it based on time

#codingexercise
Determine the ceiling of an input value in binary search tree

Ceiling is the value of a node that is the smallest of all those that are higher than the given one

Int ceil(node root, int val)
{
If (root == null) return -1;
If (root.data == val) return val;
If (root.data < input) return ceil(root.right, val);
Int ceilleft = ceil(root.left, val);
If (ceilleft == -1)
Return root.data;
Else
Return ceilleft;
}

Similarly, the floor of a value can be found in BST.

Sunday, August 28, 2016

We talked about graph extensions for relational data here. We now explore whether such an extension is possible when the entire relational database is in memory.

The challenge with graph abstraction of relational data is that it increases the size of the data with persisted predicates in the RDF description of the framework. Unless the entire RDF model is rendered, the graph operations are not executed.

There are two challenges with this:

1) The entire RDF model is expensive to create both in terms of time and space

2) The graphs have to be warm and available in memory for the duration the graph extension is available.

Graph databases that worked with NoSQL option could make use of the following two attributes:

1) Batch processing of the graph with parallel and offline analytics

2) Partitioning of graphs

Instead if we use a streaming model of the graph and provide graph queries that can operate on a stream of graph, then we can provide approximate answers even when operating entirely in memory.

Some Graph databases do not have a query language that supports streaming operations. They merely stream results.For example, cypher query language does not seem to support streams.

Microsoft Stream Insight supports stream processing

#algorithm discussion continued
Today we also continue reading the flow networks. We were looking at max flow min cut theorem and the properties associated with it. We saw how it applied to Ford Fulkerson method. Each iteration of the method augmented the flow by considering a path in the residual network until there is no more. For each path considered, the residual capacity of that path is found and augmented to the flow network.
How the path is chosen is not mentioned by the method. This is usually done with Breadth First Search
When the BFS is used to choose the augmenting path as a shortest path from s to t in the residual network, where each edge has unit distance, the algorithm is then called Edmonds-Karp algorithm
It can be shown that with this algorithm for all vertices, the shortest path distance in the residual network increases monotonically with each flow augmentation.
The use of BFS also bounds the number of iterations of the Edmond-Karp algorithm. This bound is O(VE) for a graph with V vertices and E edges. This is explained in terms of critical edges. An edge (u,v) is considered critical on a path p if the residual capacity of (u,v) is that of the path. We already know that at least one edge on any path must be critical. When a flow is augmented along the critical edge, that path disappears from the residual network because the flow is already considered. It can only appear again if we consider an augmenting path in the reverse direction. We know that the path earlier was one hop away because we used BFS. If the vertex u becomes unreachable from the source again if ever, by BFS this must be yet another hop away from start giving a total of two hops. Therefore that edge becomes critical at most (v-2)/2 times again because there are at most v/2 times that the edge become critical again.Since there are at most E edges, each can become critical at most V/2 times.

#codingexercise
Find minimum difference between any two elements of an array

int GetMinDiff(List<int> nums)
{
int min = INT_MAX;
for (int i = 0; i < nums.count; i++)
for (int j = i+1; j < nums.count; j++)
if (Math.abs(nums[i]-nums[j]) < min)
min =Math.abs(nums[i]-nums[j]);
return min;
}

Find the next greater element to the right of all elements in an array
void prntNextGreater(List<int> nums)
{
for (int i = 0; i < nums.count; i++)
{
int next = INT_MAX;
for (int j = i+1; j < nums.count; j++)
{
if (nums[i] < nums[j])
{
next = nums[j];
break;
}
}
Console.write("{0}-{1}",nums[i], next);
}
}
#puzzle
There are 25 horses among which you need to find out the fastest 3 horses. You can conduct race among at most 5 to find out their relative speed. At no point you can find out the actual speed of the horse in a race. Find out how many races are required to get the top 3 horses.
The answer is 7. The solver uses a process of exclusion. The first five races gives the best in each group and the overall number 1. The next two races give the next fastest horse each Note the selection of horses spans two groups after the first round by excluding the weak horses.

Saturday, August 27, 2016

#codingexercise
Get the maximum height of a common subtree between two binary trees.
There are a few ways to solve this problem such as evaluate each node to be root of a common subtree between both binary trees.
However, we know that when a subtree is common to both binary trees, all the elements in the sub tree will serialize into the same sequence for both binary trees. At that point, the problem decomposes to finding longest common substring between string representations of tree serializations.
void GetMaxHeightCommonSubTree(node tree1, node tree2)
{
var str1 = tree1.serialize();
var str2 = tree2.serialize();
var pattern =GetLCS(str1, str2);
var subtree = pattern.deserialize();
return GetMaxHeight(subtree);
}
string GetLCS(string str1, string str2)
{
int end = 0
int length = 0;
var lcs = new int[str1.count, str2.count]();
for (int i = 0; i < str1.length; i++)
for (int j = 0; j < str2.length; j++)
{
if (i == 0 || j == 0) lcs[i,j] = 0;
else if (str1[i-1] == str2[j-1]){
end = i;
lcs[i,j] = lcs[i-1,j-1] + 1;
length = max(length, lcs[i,j]);
}
else lcs[i,j] = 0;
}
return str1.substring(end-length+1, length);
}

2) Find if there is a path from root to leaf in a binary tree with a given sum

bool hasPathSum(node root, int sum)
{
if (root == null) return sum == 0;
bool ret = false;
if (sum-root.data == 0 && root.left == null && root.right == null) return true;
if (root.left)
ret = ret || hasPathSum(root.left, sum- root.data);
if (root.right)
ret = ret || hasPathSum(root.right, sum -root.data);
return ret;
}

3) the serialization of a binary tree can be level order traversal

Friday, August 26, 2016

Graph plugin for databases

Many companies view graphs as an abstraction rather than an implementation of the underlying database. There are two reasons for this :

1) Key-value stores suffice to capture the same information in a graph and can provide flexibility and speed for operations that can be translated as queries on these stores. Then these can be specialized for the top graph features that an application needs

2) Different organizations within the company require different stacks built on the same logical data for reasons such as business impact, modular design and ease of maintenance.

Even large companies such as Twitter and Facebook don’t necessarily use graph databases for their social network data. They use graph database concepts but their implementations are usually highly specialized and optimized REDIS, MySQL or such other databases.

These companies use a graph query layer that enables them to do things such as graph traversals which are often needed in graph operations but they execute against traditional Relational database management systems.

This abstraction is facilitated by a Resource Description Framework data model which is intrinsically a labeled, directed multi-graph. This kind of modeling is similar to classical data modeling and it makes statements about resources in the form of subject-predicate-object expressions. The subject denotes the resource, and the predicate denotes traits or relationships between the subject and the objects. These expressions are known as triples in RDF terminology and can even be persisted in relational database.

Plugins for conventional relational databases exploit RDF and other data models such as Network data models or a collection of spatial features in their offerings for supporting graph layer as an overlay. A network data model is a property graph model used to model and analyze physical and logical networks. A geospatial feature consists of a geometric representation of the data in some co-ordinate space.

Many companies embrace key value stores over relational data saying that relational data does not scale to arbitrary size without performance degradation. This is not necessarily the fault of the relational database but seems to be from its usage. In fact, large databases of relational data can support recursive queries allowing native and direct translation of graph operations.

Oracle did a major study on its spatial and graph plugin for both relational and NoSQL data. In its approach, it consolidated several data sources or data types such as medical devices, lab information systems, subscription services and legacy records from different data servers such as Lab/Clinical, Research, Content Management, Billings/Claims and Reporting/BI into a central domain vocabulary. This central repository was none other than an integrated graph metadata that was implemented as an RDF Triple Store. They showed that it was scalable to billions of triples with RAC and exadata scalability.
Courtesy : Oracle studies
#codingexercise
Find a subarray with sum k in a list of positive integers

Void sum(list<int> num, int k, ref list<int> candidate, int start)
{
If (candidate.sum() == k) { console.write(candidate.toString()); return;}
If (candidate.sum() > k) return;
If (start == num.count) return;
// with
Candidate.add(num[start]);
Sum(num, k, ref candidate, start+1);
Candidate.removeLast();
// without
Sum(num, k, ref candidate, start +1);
}

Another method to find contiguous sum is as follows:
int GetSumK(List<int>num, int k)
{
if (num.count == 0) return 0;
int curr = num[0];
int start = 0;
for (int i = 1; i <= num.count; i++)
{
while (cur > K && start < i-1)
{
curr = curr - num[start];
start++;
}
if (cur == K)
{
console.write("sum k between {0} and {1}", start, i-1);
return 1;
}
if (i < n)
curr = curr + num[i];
}
return 0;
}

Thursday, August 25, 2016

Today we continue reading the flow networks. We were looking at max flow min cut theorem and the properties associated with it. We saw how it applied to Ford Fulkerson method. Each iteration of the method augmented the flow by considering a path in the residual network until there is no more. For each path considered, the residual capacity of that path is found and augmented to the flow network.
How the path is chosen is not mentioned by the method. This is usually done with Breadth First Search
When the BFS is used to choose the augmenting path as a shortest path from s to t in the residual network, where each edge has unit distance, the algorithm is then called Edmonds-Karp algorithm
It can be shown that with this algorithm for all vertices, the shortest path distance in the residual network increases monotonically with each flow augmentation.
#puzzle
There are 3 ants sitting on three corners of a triangle. All ants randomly pick a direction and start moving along edge of the triangle. What is the probability that any two ants collide?
There are two choices for each ant that makes the total number of possibilities as 2 ^ 3 = 8
Out of these only two - clockwise and anticlockwise movements will not result in a collision. So the answer is 6/8
#coding interview
How do we sort an array consisting of three elements 0,1, 2 without counting ?
We use the Dutch National Flag Algorithm
This algorithm is described as
Divide the array into four regions:
1 to Low -1 as all zeros
Low to Mid - 1 as all ones
Mid to High as unknown
and High+1 to N as twos

Initially Low is set to 1, Mid is set to 1 and High to N and we shrink the unknown by swapping the mid element with the corresponding region.
while mid <= Hi
if A[mid] == 0 Swap(A,low,mid) low++ mid++
if A[mid] == 1 Mid++;
if A[mid] == 2 Swap(A,mid, high) high--

void SortThrees(List<int> nums)
{
assert(nums.all(x => x == 0 || x == 1 || x == 2));
int N = nums.Count;
int low = 1;
int mid = 1;
int high = N;
while ( mid <= high)
{
switch(A[mid-1])
{
case 0: {Swap (ref nums, low-1, mid-1); low++; mid++; break;}
case 1: { mid++, break;}
case 2: {Swap(ref nums, mid-1, high-1); high--; break;}
default:
raise Exception("Invalid element");
break;
}
}
}

Wednesday, August 24, 2016

There are other approaches possible to merging the Binary Search Tree

1) Add the elements of one tree to the other one at a time by moving the root until there are no more nodes in the tree

2) Transform one of the trees into inorder and insert the items into the other tree.

3) Jointly traverse both trees to find the minimum in either tree, delete it and add to a new BST

Deleting a node from a tree

Node TreeDelete(ref node root, node z)

{

If (z==null) return NULL;

If (root == null) return NULL;

Node y = null;

If (z.left == null || z.right ==null)

Y = z;

Else

Y = TreeSuccessor(ref node, z) ;

X = null ;

If (y.left != null)

X = y.left ;

else

x = y.right ;

If (x != NULL)

x.parent = y.parent ;

If (y.parent == NULL)

Root = x ;

Else if (y = y.parent.left)

then

y.parent.left = x ;

else

y.parent.right = x ;

if (y != z )

then z.data = y.data ;

return y

}

void TreeInsert(Node tree1, Node z)

{

Node Y = NULL;

Node x = tree1;

While (x != NULL){

Y = X;

If (z.data < x.data)

X= x.left;

Else

X = x.right;

z.parent = y;

if (y == NULL)

tree1 = Z;

else if (z.data < y.data)

z = y.left;

else

z = y.right;

}

Node Merge(Node tree1, Node tree2)

{

Node current = tree2;

While (current)

{

Var root = TreeDelete(ref tree2, current);

TreeInsert(ref tree1, root);

Current = tree2;

}

Return tree1;

}
Node TreeSuccessor(Node x)
{
If (x == null) return null;
If (x.right) return TreeMinimum(x.right);
Node y = x.parent;
While(y != null && y.right == x){
X = y;
Y = y.p;
}
Return y;
}

Node TreePredecessor(Node x)

{

If (x == null) return null;

If (x.left) return TreeMaximum(x.left);

Node y = x.parent;

While(y != null && y.left == x){

X = y;

Y = y.p;

}

Return y;

}

Node TreeMinimum(node root)
{
If(root == null) return null;
Node current = root;
While(current && current.left)
Current = current.left;
Return current;
}

Node TreeMaximum(node root)
{
If(root == null) return null;
Node current = root;
While(current && current.right)
Current = current.right;
Return current;

}