Cluster computing

Wednesday, March 9, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs that don't fit on a single server. The paper introduces Horton a graph query execution engine. We were reviewing its architecture.
It has four components - the graph client library, the graph coordinator, graph partitions and the graph manager. We discussed the first three components. We now look at the Graph Manager. This provides an administrative interface to manage the graph such as loading a partitioned graph or adding and removing servers. It also assigns partitions to the servers by taking the output of partitioning algorithm and placing nodes in the right partitions. Partitioning algorithms are external because they are not online. Any algorithm can be used such as a simple hashing that's based on node and edge attributes. If the partitions don't suffice in keeping graph access local, then the partitioning algorithm can be changed.
Horton is written in C# using the .Net framework and the asynchronous communication is implemented using sockets and task parallel library.Horton is built on top of Orleans which is a distributed runtime for cloud applications.
Some sample queries on the graph for Codebook using Horton include the following:
Who wrote this function ?
What libraries depend on this library ?
Whom should I contact to fix this bug ?
Adhoc queries can also be executed showing the flexibility of the query language and the effectiveness of the execution engine.
#codejam problem
We discussed compareTo between ominoes in the previous post that takes different orientations and then compares. Tanslations and different positions in the matrix can also make the ominoes look different. Implement compare for ominoes with same orientation

int compare(int[,] matrix1, int[,] matrix2, int rows, int columns)
{
int k = find_top_left_corner(matrix1);
int tr1 = k /columns;
int tc1 = k % columns;
int h = find_bottom_right_corner(matrix1);
int br1 = h/columns;
int bc1 = h%columns;

k = find_top_left_corner(matrix1);
int tr2 = k /columns;
int tc2 = k % columns;
h = find_bottom_right_corner(matrix1);
int br2 = h/columns;
int bc2 = h%columns;

for (int i = 0; i <= br1-tr1; i++)
for (int j = 0; j <= bc1-tc1; j++)
if (matrix1[br1+i,bc1+j] != matrix [br2+i, bc2+j])
return matrix1[br1+i,bc1+j] - matrix [br2+i, bc2+j];

return 0;
}

In comparing the sequences of 1 and 0 for the portion containing the arrangementsame in the two matrices, we pick the diagonal corners of the bounding rectangle. To determine this we pick the
Min pair (r, c) and max pair (r,c) from all those having value 1
Int r = k;
Int c = k;
For (int I = 0; I < k; i++)
For (int j =0; j < k; j++)
If ( matrix1[i, j] == 1 ){
If (I < r) r = i;
If (j < c) c = j;
}

Tuesday, March 8, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs that don't fit on a single server. The paper introduces Horton a graph query execution engine. We were reviewing its architecture.
It has four components - the graph client library, the graph coordinator, graph partitions and the graph manager. The graph client library sends queries to the graph coordinator which prepares an execution plan for the query. The graph partition manages a set of graph nodes and edges. Horton is able to scale out mainly because of graph partitions. The graph manager provides an administrative interface to manage the graph with chores like loading and adding and removing servers.
We review these in detail.
The graph coordinator receives the query from the client library, validates it and parses it, optimizes a query plan and translates it into a finite state machine which is sent to the partitions for parallel execution. The validation involves checks against the nodes and edge predicates. The optimization involves taking a query in a parsed regular expression form, enumerating the different execution plans and choosing the one with the lowest cost. It uses a dynamic programming framework to enumerate the execution plans and to choose the one with the lowest cost. One example of optimization is to order the predicates. A good choice of predicate order can save costs that are orders of magnitude different from each other. If n is the number of predicates in the query, this method finds a suitable choice in O(n^3). After the query is optimized, the query plan is translated into a finite state machine The state machine not only expresses the execution efficiently but also packages it in a way that can be easily sent to local graph partitions for execution by their local engines. The communication to send state machine and get results is done in an asynchronous manner which allows for the results to be directly streamed from a graph partition machine to the client without involving the graph co-ordinator.
#codingquestion
In the previous examples we used NChooseK as a recursive function to find K from N. If we were to partition the data, we could rewrite this

void NChooseKPartitioned(int N, int K, ref List<int> array, int start, int count)
{
m = N/2;
n = K/2;
return NChooseK(m, n, ref List<int> array, start, count) + NChooseK(N-m, K-n, ref List<int> array, start, count);
}

To compare if two ominous are the same by change of orientation or position
Int compareTo (int [,] matrix1, int [,] matrix2)
{
For (int i=0; I < 360; I += 90)
{
var rotated = rotate (matrix1, i);
If (compare (matrix2, rotated) == 0)
Return 0;
}
Return count-of-1 (matrix1) - count-of-1 (matrix2);
}

Monday, March 7, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs. These graphs don't fit on a single server. Map-Reduce and Pregel are not good enough for processing interactive adhoc queries against large graphs. Horton is different. Let us take a look at its architecture. It has four components - the graph client library, the graph coordinator, graph partitions and the graph manager. The graph client library sends queries to the graph coordinator. It uses an asynchronous messaging system to receive results. The graph coordinator prepares an execution plan for the query. converts it to an automata and decides which partition to run. Each partition runs the executor and returns the results back to the client. The partitions are managed by the graph manager. The relevant partition is transfered to main memory and back to persistent storage.
The graph client library - sends queries to the graph coordinator in the form of regular expressions, and receives the query results directly from the graph partitions.
The graph coordinator provides the query interface for Horton. It receives the query from the client library, validates and parses the query. The coordinator optimizes the query plan and translates the query into a finite state machine. This is then sent for parallel processing over the partitions.
Every graph partition manages a set of graph nodes and edges.Horton is able to scale out only because of partitions. The automata received from the coordinator is executed on the local partition using a local query execution engine.
#codejam continued
Yesterday we were filling a linear array of size n with k elements of value 1. We called it NChooseK. Today we would like those arrangements of NChooseK where K are contiguous.

void NChooseKContiguous(int N, int K, ref List<List<int>> arrays)
{
debug.assert(N>K && N > 0 && K > 0 && arrays!= NULL);
for (int i = 0; i < N-K+1; i++)
{
var array = new List<int>(N);
array.AddRange(Enumerable.Repeat(0,N));
for (int j = i; j < K; j++)
array[i] = 1;
arrays.add(array);
}
}

Sunday, March 6, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs. These graphs don't fit on a single server. Map-Reduce and Pregel are not good enough for processing interactive adhoc queries against large graphs.Horton is a distributed interactive query execution engine for large graphs. We were reviewing that large graphs pose a different challenge from a number of small graphs. The latter has examples in bioinfomatics and chemoinformatics. Moreover, the conventional solution for querying large graphs based on offline batch processing is different from the corresponding online interactive query requirements. To manage large graphs online while supporting querying with small latency, a new system was required to be built. Horton provides a graph query execution engine and a graph query optimizer that is able to efficiently process online queries on large graphs. Further it uses a design decision to partition the graph among several servers in order to facilitate online processing. Horton stores the graph partitions in main memory to offer fast query response time.
We now review Horton data model and query language. A graph database stores not only the nodes or entities but also the relationships or connections and these connections can have rich attributes.
Horton too supports rich graph data. Nodes represent entities and have types as well as a set of key value pairs representing the data associated with this entity. Edges have type and data as well. There is no restriction on the kind or amount of data associated with nodes and edges. Horton manages both directed and undirected graph.If the graph is directed, each node stores both inbound and outbound edges
Queries are in the form of regular language expressions. A query is a sequence of node predicates and edge predicates. Each predicates can contain conjunctions and disjunctions on node and edge attributes as well as closures. For example, a query to retrieve a list of all the photos in which John is tagged is expressed as Photos-tagged-John. In this case all paths in the graph that satisfy the query expression are returned. To get a specific part of the path such as a node, a SELECT statement is used.
Select photo from Photo-Tagged-John. Here the node photo is the only one but there could have been other nodes and edges in the query
The syntax is helpfully closer to SQL unlike Cypher. This was also a design decision In general, graph databases perform better than relational in highly interconnected data where a nearly online data warehouse is required.
#codejam continued
Fill a linear array of size n with k elements of value 1

void NChooseK(int N, int K, ref List<int> array, int start, int count)
{
if (count ==k) { // print array and return; }
if (start == N) return;
for (int i=0; i < 2; i++)
{
if (i == 0) NChooseK(N,K, ref array, start+1, count);
else{
array[i] = 1;
NChooseK(N,K, ref array, start+1, count+1);
array[i] = 0;
}
}
}

Saturday, March 5, 2016

Today we start reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs. These graphs don't fit on a single server. Map-Reduce and Pregel are not good enough for processing interactive adhoc queries against large graphs.Horton is a distributed interactive query execution engine for large graphs. It includes a query execution engine and a query optimizer that allows interactive execution of queries on large distributed graphs in parallel. In this paper the graph for study came from a social networking application called Codebook which uses a graph to model data on software components, developers, development artifacts and their interactions in large software projects. This kind of information about software engineers, software components, and their interactions in large software projects such as the Windows or Linux Operating Systems results in large graphs. Online querying of large graphs is very different from conventional database querying. Conventional databases store highly structured data and there is never a join performed without a join table that holds the foreign keys of both participating tables. Moreover in relational databases, references to other rows are indicated by referring to their primary key attributes via foreign columns. This is enforceable with constraints but only when the reference is never optional. Joins are computed at query time that is both compute and memory intensive. Graph databases are an improvement where the relationships are the first class citizens of the graph data model. Nodes and relationships are assembled into connected structures. Each node in the graph database model directly and physically contains a list of relationship-records that represents its relationship to other nodes.Although a graph database stores connections as first class citizens readily available thereby avoiding relationship computations at query times, it explores the larger neighborhood around the initial starting point leaving billions outside the search perimeter. Neo4j implements such a NoSQL graph database. The Map-Reduce framework can help with such queries but they are meant for batch processing not online queries. Even though their throughput can be increased to be at par with a low latency online query, they are fundamentally not the same. The paper describing Pregel states that graph algorithms often exhibit poor locality of memory access, very little work per access, and a changing degree of parallelism over the course of execution. Distribution over many machines and their faults makes the locality issue even worse. Arguably such data can be stored as distributed database clusters but even then there is no easy way to state reachability queries in the relational world. Besides large graphs pose a different challenge than hundreds of smaller graphs. Horton partitions the graph in among several servers and stores the graph partitions in main memory to offer fast query response time. The partitioning is horizontal to scale with the machines in the data centers.

#codejam problems
We have seen what ominoes are in the exercise described earlier. K-ominoes are some arrangement of K unit squares so that the unit squares remain joined on one or more edges. Determine how many different ominoes there are for a size k.
int GetOminoes (int k)
{
Int count = 0;
Var matrix = new int [k,k];
var ominoes = new List<int[,]>();
GetCombinations ( k, ref matrix, 0, ref count, ref ominoes);
Return count;
}

Void GetCombinations ( int k, int [,] matrix, int level, int ref count, List<int[,]> ominoes){
if (k==0) {
if (ominoes.Any(x => x.compareTo(matrix) == 0) == False)
{
count += 1;
ominoes.Add(matrix);
}
return;
}
for (int base = k; base > 0; base--)
{
if ( level == 0) { // contiguous occupancy k choose base }
else{ k choose base }
if (any-islands) continue;
GetCombinations(k - base, matrix, level + 1, count);
// reset occupancy
}
}

One way would be to
// fill the matrix with k 1s
// Eliminate those that are disjoint
// Ignore duplicates from rotations and translations or comparing two distributions rowcounts and columncounts sequences with all of four orientations

Another way would be to sequentially reduce the base from k to 1.
At each chosen base m increase levels with recursion for the remaining by choosing base from k to 1. For each chosen base and level, find different positions or arrangements linearly

Friday, March 4, 2016

A graph is a convenient data structure to represent such things as social networks, word relations etc.
Operations on directed graphs are easy to perform if there are no cycles.
To determine if a directed graph is acyclic, we discussed a method but didn’t review the proof.
Let’s take a look at it now:
DFS ( V, E)
For each vertex v in V
V.color=white
V.d = nil
Time = 0
For each vertex v in V:
If v.color == white:
DFS-Visit (V, E)

DFS-VISIT (V,E, u)
time = time + 1
u.d = time
u.color = gray
foreach vertex v adjacent to u
If v.color == white
DFS-VISIT(V,E,v)
Else
If v.d <= u.d < u.f <= v.f throw back edge exception.
u.color = black
time = time + 1
u.f = time

This method is called topological sorting. Essentially this method says that
there exists an order in which the
vertices can be listed such that if u is before v in the
list then there is no edge from u to v.This is called
"topological order". It’s easy to prove this. Consider the vertices listed in the order v1, v2, v3 … vk, … vn
This means that each successive call is to the next neighbor.
Let us traversed vk. Let w be the one adjacent to vk. Then there are two cases
W.f != nil in this case DFS-VISIT (w) has already completed do w must have already been listed.
W.f == nil in this case the last thing that happens is vk is printed. Therefore w would be printed before then.
Therefore when an element is printed, each element adjacent to the original element has already been printed. This is topological sorting.
We now discuss connected components . This is a way to test the bipartiteness of a graph. A bipartite graph is one that does not have an odd cycle. Here we label each vertex with a component number such that two vertices are connected if and only if they have the same component number.

The DFS for this looks like this :
for all v in V do visited(v) = 0
c = 0
for all v in V do {
if (visited(v)==0) {
DFS_VISIT(v)
c = c+1
}
}

DFS_VISIT (v) {
visited(v) = 1
comp(v) = c
for all w in adj(v) do if (visited(w)==0) DFS_VISIT (w)
}

Now we determine the bipartiteness this way
for all v in V do visited(v) = 0
for all v in V do if (visited(v)==0) DFS_VISIT (v,0)

DFS_VISIT (v,p) {
visited(v) = 1
part(v) = p
for all w in adj(v) do {
if (visited(w)==0) {
DFS_VISIT (w,1-p)
} else {
if (part(v) == part(w)) print "graph not bipartite"
}
}
}

Courtesy: Sleator notes CMU

Thursday, March 3, 2016

Gradle versus Jenkins

Gradle is a build automation system that enables a continuous pipeline for software as and when it is written. It has become a favorite for open source projects and comes with the following advantages
1) Applicable to nontrivial projects
2) Enables complex build to be portable
3) Seen as a just right balance between ant and maven
4) Allows for convention over configuration
It helps take the software from a developers machine to the production system after eliminating the delays and errors that plagued this process.

Jenkins attempts to build and push code to staging systems.attempts it comes with a few features such as
1) integration with source control systems
2) polling mechanism for changes
3) cron for scheduling
4) integration with cloudfoundry for pushing code
The advantages of Gradle are :
1) It can handle build dependencies as a directed acyclic graph
2) It can exclude certain tasks
3) It has powerful language constructs to determine build order
4) When tasks are excluded, their dependencies are also excluded.
5) Tasks can be used to finalize another task
6) Builds are easier to maintain and more robust because tasks and their output are known
7) Gradle enables multiple tasks to be executed
8) It can continue execution after failure
9) It can permit dry run
10) It creates meaningful output while allowing rerouting to external tools

Gradle essentially guarantees that the dependency graph is acyclic and it performs a lazy evaluation.
To determine if a directed graph is acyclic:
DFS ( V, E)
For each vertex v in V
V.color=white
V.d = nil
Time = 0
For each vertex v in V:
If v.color == white:
DFS-Visit (V, E)

DFS-VISIT (V,E, u)
time = time + 1
u.d = time
u.color = gray
foreach vertex v adjacent to u
If v.color == white
DFS-VISIT(V,E,v)
If v.d <= u.d < u.f <= v.f throw back edge exception.
u.color = black
time = time + 1
u.f = time

In the ominoes problem described in the previous post, we observe the following:
if (X > 6) then Richard wins
if (X < 3) then Gabrielle wins
if (min(R,C) < (X+1)/2) then Richard wins
if (X==3) then Gabrielle wins
if (X == 4) then Gabrielle wins if min(R,C) > 2 else Richard
if (X == 5) then there must be at least three pieces to place for Gabrielle to win.

Another way to look at this problem is to a pick out the conditions based on Board size. This simplifies the problem.
If the board cannot be occupied all by x square units then Richard wins.
If the board is linear and it has size more than two square units then Richard wins.
If the board is square and there is only one piece possible then Richard wins. Everything else Gabrielle wins.