Cluster computing

Monday, March 7, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs. These graphs don't fit on a single server. Map-Reduce and Pregel are not good enough for processing interactive adhoc queries against large graphs. Horton is different. Let us take a look at its architecture. It has four components - the graph client library, the graph coordinator, graph partitions and the graph manager. The graph client library sends queries to the graph coordinator. It uses an asynchronous messaging system to receive results. The graph coordinator prepares an execution plan for the query. converts it to an automata and decides which partition to run. Each partition runs the executor and returns the results back to the client. The partitions are managed by the graph manager. The relevant partition is transfered to main memory and back to persistent storage.
The graph client library - sends queries to the graph coordinator in the form of regular expressions, and receives the query results directly from the graph partitions.
The graph coordinator provides the query interface for Horton. It receives the query from the client library, validates and parses the query. The coordinator optimizes the query plan and translates the query into a finite state machine. This is then sent for parallel processing over the partitions.
Every graph partition manages a set of graph nodes and edges.Horton is able to scale out only because of partitions. The automata received from the coordinator is executed on the local partition using a local query execution engine.
#codejam continued
Yesterday we were filling a linear array of size n with k elements of value 1. We called it NChooseK. Today we would like those arrangements of NChooseK where K are contiguous.

void NChooseKContiguous(int N, int K, ref List<List<int>> arrays)
{
debug.assert(N>K && N > 0 && K > 0 && arrays!= NULL);
for (int i = 0; i < N-K+1; i++)
{
var array = new List<int>(N);
array.AddRange(Enumerable.Repeat(0,N));
for (int j = i; j < K; j++)
array[i] = 1;
arrays.add(array);
}
}

Sunday, March 6, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs. These graphs don't fit on a single server. Map-Reduce and Pregel are not good enough for processing interactive adhoc queries against large graphs.Horton is a distributed interactive query execution engine for large graphs. We were reviewing that large graphs pose a different challenge from a number of small graphs. The latter has examples in bioinfomatics and chemoinformatics. Moreover, the conventional solution for querying large graphs based on offline batch processing is different from the corresponding online interactive query requirements. To manage large graphs online while supporting querying with small latency, a new system was required to be built. Horton provides a graph query execution engine and a graph query optimizer that is able to efficiently process online queries on large graphs. Further it uses a design decision to partition the graph among several servers in order to facilitate online processing. Horton stores the graph partitions in main memory to offer fast query response time.
We now review Horton data model and query language. A graph database stores not only the nodes or entities but also the relationships or connections and these connections can have rich attributes.
Horton too supports rich graph data. Nodes represent entities and have types as well as a set of key value pairs representing the data associated with this entity. Edges have type and data as well. There is no restriction on the kind or amount of data associated with nodes and edges. Horton manages both directed and undirected graph.If the graph is directed, each node stores both inbound and outbound edges
Queries are in the form of regular language expressions. A query is a sequence of node predicates and edge predicates. Each predicates can contain conjunctions and disjunctions on node and edge attributes as well as closures. For example, a query to retrieve a list of all the photos in which John is tagged is expressed as Photos-tagged-John. In this case all paths in the graph that satisfy the query expression are returned. To get a specific part of the path such as a node, a SELECT statement is used.
Select photo from Photo-Tagged-John. Here the node photo is the only one but there could have been other nodes and edges in the query
The syntax is helpfully closer to SQL unlike Cypher. This was also a design decision In general, graph databases perform better than relational in highly interconnected data where a nearly online data warehouse is required.
#codejam continued
Fill a linear array of size n with k elements of value 1

void NChooseK(int N, int K, ref List<int> array, int start, int count)
{
if (count ==k) { // print array and return; }
if (start == N) return;
for (int i=0; i < 2; i++)
{
if (i == 0) NChooseK(N,K, ref array, start+1, count);
else{
array[i] = 1;
NChooseK(N,K, ref array, start+1, count+1);
array[i] = 0;
}
}
}

Saturday, March 5, 2016

Today we start reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs. These graphs don't fit on a single server. Map-Reduce and Pregel are not good enough for processing interactive adhoc queries against large graphs.Horton is a distributed interactive query execution engine for large graphs. It includes a query execution engine and a query optimizer that allows interactive execution of queries on large distributed graphs in parallel. In this paper the graph for study came from a social networking application called Codebook which uses a graph to model data on software components, developers, development artifacts and their interactions in large software projects. This kind of information about software engineers, software components, and their interactions in large software projects such as the Windows or Linux Operating Systems results in large graphs. Online querying of large graphs is very different from conventional database querying. Conventional databases store highly structured data and there is never a join performed without a join table that holds the foreign keys of both participating tables. Moreover in relational databases, references to other rows are indicated by referring to their primary key attributes via foreign columns. This is enforceable with constraints but only when the reference is never optional. Joins are computed at query time that is both compute and memory intensive. Graph databases are an improvement where the relationships are the first class citizens of the graph data model. Nodes and relationships are assembled into connected structures. Each node in the graph database model directly and physically contains a list of relationship-records that represents its relationship to other nodes.Although a graph database stores connections as first class citizens readily available thereby avoiding relationship computations at query times, it explores the larger neighborhood around the initial starting point leaving billions outside the search perimeter. Neo4j implements such a NoSQL graph database. The Map-Reduce framework can help with such queries but they are meant for batch processing not online queries. Even though their throughput can be increased to be at par with a low latency online query, they are fundamentally not the same. The paper describing Pregel states that graph algorithms often exhibit poor locality of memory access, very little work per access, and a changing degree of parallelism over the course of execution. Distribution over many machines and their faults makes the locality issue even worse. Arguably such data can be stored as distributed database clusters but even then there is no easy way to state reachability queries in the relational world. Besides large graphs pose a different challenge than hundreds of smaller graphs. Horton partitions the graph in among several servers and stores the graph partitions in main memory to offer fast query response time. The partitioning is horizontal to scale with the machines in the data centers.

#codejam problems
We have seen what ominoes are in the exercise described earlier. K-ominoes are some arrangement of K unit squares so that the unit squares remain joined on one or more edges. Determine how many different ominoes there are for a size k.
int GetOminoes (int k)
{
Int count = 0;
Var matrix = new int [k,k];
var ominoes = new List<int[,]>();
GetCombinations ( k, ref matrix, 0, ref count, ref ominoes);
Return count;
}

Void GetCombinations ( int k, int [,] matrix, int level, int ref count, List<int[,]> ominoes){
if (k==0) {
if (ominoes.Any(x => x.compareTo(matrix) == 0) == False)
{
count += 1;
ominoes.Add(matrix);
}
return;
}
for (int base = k; base > 0; base--)
{
if ( level == 0) { // contiguous occupancy k choose base }
else{ k choose base }
if (any-islands) continue;
GetCombinations(k - base, matrix, level + 1, count);
// reset occupancy
}
}

One way would be to
// fill the matrix with k 1s
// Eliminate those that are disjoint
// Ignore duplicates from rotations and translations or comparing two distributions rowcounts and columncounts sequences with all of four orientations

Another way would be to sequentially reduce the base from k to 1.
At each chosen base m increase levels with recursion for the remaining by choosing base from k to 1. For each chosen base and level, find different positions or arrangements linearly

Friday, March 4, 2016

A graph is a convenient data structure to represent such things as social networks, word relations etc.
Operations on directed graphs are easy to perform if there are no cycles.
To determine if a directed graph is acyclic, we discussed a method but didn’t review the proof.
Let’s take a look at it now:
DFS ( V, E)
For each vertex v in V
V.color=white
V.d = nil
Time = 0
For each vertex v in V:
If v.color == white:
DFS-Visit (V, E)

DFS-VISIT (V,E, u)
time = time + 1
u.d = time
u.color = gray
foreach vertex v adjacent to u
If v.color == white
DFS-VISIT(V,E,v)
Else
If v.d <= u.d < u.f <= v.f throw back edge exception.
u.color = black
time = time + 1
u.f = time

This method is called topological sorting. Essentially this method says that
there exists an order in which the
vertices can be listed such that if u is before v in the
list then there is no edge from u to v.This is called
"topological order". It’s easy to prove this. Consider the vertices listed in the order v1, v2, v3 … vk, … vn
This means that each successive call is to the next neighbor.
Let us traversed vk. Let w be the one adjacent to vk. Then there are two cases
W.f != nil in this case DFS-VISIT (w) has already completed do w must have already been listed.
W.f == nil in this case the last thing that happens is vk is printed. Therefore w would be printed before then.
Therefore when an element is printed, each element adjacent to the original element has already been printed. This is topological sorting.
We now discuss connected components . This is a way to test the bipartiteness of a graph. A bipartite graph is one that does not have an odd cycle. Here we label each vertex with a component number such that two vertices are connected if and only if they have the same component number.

The DFS for this looks like this :
for all v in V do visited(v) = 0
c = 0
for all v in V do {
if (visited(v)==0) {
DFS_VISIT(v)
c = c+1
}
}

DFS_VISIT (v) {
visited(v) = 1
comp(v) = c
for all w in adj(v) do if (visited(w)==0) DFS_VISIT (w)
}

Now we determine the bipartiteness this way
for all v in V do visited(v) = 0
for all v in V do if (visited(v)==0) DFS_VISIT (v,0)

DFS_VISIT (v,p) {
visited(v) = 1
part(v) = p
for all w in adj(v) do {
if (visited(w)==0) {
DFS_VISIT (w,1-p)
} else {
if (part(v) == part(w)) print "graph not bipartite"
}
}
}

Courtesy: Sleator notes CMU

Thursday, March 3, 2016

Gradle versus Jenkins

Gradle is a build automation system that enables a continuous pipeline for software as and when it is written. It has become a favorite for open source projects and comes with the following advantages
1) Applicable to nontrivial projects
2) Enables complex build to be portable
3) Seen as a just right balance between ant and maven
4) Allows for convention over configuration
It helps take the software from a developers machine to the production system after eliminating the delays and errors that plagued this process.

Jenkins attempts to build and push code to staging systems.attempts it comes with a few features such as
1) integration with source control systems
2) polling mechanism for changes
3) cron for scheduling
4) integration with cloudfoundry for pushing code
The advantages of Gradle are :
1) It can handle build dependencies as a directed acyclic graph
2) It can exclude certain tasks
3) It has powerful language constructs to determine build order
4) When tasks are excluded, their dependencies are also excluded.
5) Tasks can be used to finalize another task
6) Builds are easier to maintain and more robust because tasks and their output are known
7) Gradle enables multiple tasks to be executed
8) It can continue execution after failure
9) It can permit dry run
10) It creates meaningful output while allowing rerouting to external tools

Gradle essentially guarantees that the dependency graph is acyclic and it performs a lazy evaluation.
To determine if a directed graph is acyclic:
DFS ( V, E)
For each vertex v in V
V.color=white
V.d = nil
Time = 0
For each vertex v in V:
If v.color == white:
DFS-Visit (V, E)

DFS-VISIT (V,E, u)
time = time + 1
u.d = time
u.color = gray
foreach vertex v adjacent to u
If v.color == white
DFS-VISIT(V,E,v)
If v.d <= u.d < u.f <= v.f throw back edge exception.
u.color = black
time = time + 1
u.f = time

In the ominoes problem described in the previous post, we observe the following:
if (X > 6) then Richard wins
if (X < 3) then Gabrielle wins
if (min(R,C) < (X+1)/2) then Richard wins
if (X==3) then Gabrielle wins
if (X == 4) then Gabrielle wins if min(R,C) > 2 else Richard
if (X == 5) then there must be at least three pieces to place for Gabrielle to win.

Another way to look at this problem is to a pick out the conditions based on Board size. This simplifies the problem.
If the board cannot be occupied all by x square units then Richard wins.
If the board is linear and it has size more than two square units then Richard wins.
If the board is square and there is only one piece possible then Richard wins. Everything else Gabrielle wins.

Wednesday, March 2, 2016

#code jam question.
There is a house of pancakes with limited number of pancakes that are initially distributed to D diners with ease having di pancakes on their plate where I is the index of the diner. Each diner takes one minute to eat one pancake. There are infinite other diners who have empty plates. The server can halt all diners from eating by calling a minute as special. During this minute, he can take away some pancakes from one diner and give it to another. If this is the only move he can do, what is the minimum time to finish the set of pancakes ?
Solution since the number of empty diners are infinite we can increase the degree of parallelism by adding empty diners from one through a large number, say 1001. There is no efficiency possible without increasing the DOP and we exhaust the search space to find the best DOP by incrementing it by 1 at a time.moreover each pancake can contribute to an increase in DOP For each incremental DOP we look for the mininum time. The minimum time
is a cumulative of the reduced number of
pancakes for each no empty diner. We know that the reduction happens with an increase in DOP x by 1 during a special minute. Consequently our solution is the minimum time calculated as the minimum of the sum of time takwn by each non empty diner for that DOP.

Int getMinTime (list <int> pancakes)
{
Int min =INT_MAX;
For (int x=1; x < 1001; x++;)
{
Int sum =0;
For (int I =0; i < pancakes.count(); i++;)
{
Sum += (pancakes [i] + x - 1) / x - 1;
}
sum += x;
If (sum < min)
min = sum;
}
Return min;
}

Today we conclude reading the paper titled Jigsaw : Efficient, low effort mashup isolation by James Mickens and Matthew Finifter. Mashups are web applications that contain code from different principals.Jigsaw lets mashups be isolated with its framework and lets them selectively expose private state. At the same time Jigsaw is developer friendly requiring very little changes as compared to the other related work with their cumbersome policies. The use of public/private keywords is also natural to programming languages such as Java. Related work also use frames for isolation. Jigsaw uses boxes that can share the same origin. Jigsaw also uses proxy for principal objects that have to be shared across domain. Moreover Jigsaw facilitates pass by reference for surrogate objects for cross domain access. Jigsaw also has similar or better performance than other related work.

Tuesday, March 1, 2016

Today we continue to discuss jigsaw.Efficient, low effort mashup isolation by James Mickens and Matthew Finifter. Mashups are web applications that contain code from different principals.

Jigsaw lets mashups be isolated with its framework and lets them selectively expose private state. We were discussing the related work. There were two techniques that came pretty close to Jigsaw. First, there was ObjectView with its flexible security policy. However it seems too clunky and an overkill as compared to Jigsaw declarative public and private keywords. Second was IFC which required developers to annotate and the engine to apply security tags to variables. Again the public private syntax is much easier. Moreover tagging variables is not suited for managing access to visual fields. Jigsaw does this with the full flexibility that case allows.
Conscript is another library that modifies the browser engine to apply security. Integrators specify policy code and these are attached at specific execution points such as the start of a function or the load of a module. These policies are similar to ObjectView policies and have the same advantages but jigsaw syntax is simpler.
OMash comes closer to Jigsaw in that it allows public functions to be declared for sharing. However it does not enforce private only access by default. Moreover the return value from the function can divulge sensitive data. On the other hand, Jigsaw allows both public and private and supports proxies.
MashupOS provides a new model where services instances are similar to the processes on an operating system and get partitioned physical resources. They communicate with each other using pass by value messages. This is elaborate by itself and still has shortcomings that Jigsaw does not. Jigsaw avoids such communication altogether with its pass by reference semantics.
CommonJs introduces another technique that gives each library it's own namespace and expose functions. However it suffers from the same limitations with regard to property manipulation and return value exploit as OMash. This is because the namespace are implemented using closure which allow prototypes to be transformed as opposed to Jigsaw's box.

#Google code jam
Ominoes are jigsaw puzzle pieces made by joining unit square pieces along one or more of their edges. An X ominous has X such unit squares. Richard picks one of the X Ominoes and Gabrielle uses that piece and Amy other piece or their copies to fill an RxC grid. If Gabrielle fills the grid she wins otherwise Richard wins. Given arbitrary X, R, C determine who wins
Void GetWinner (int X, int R, int C)
{
String answer = "Richard";
If ((R×C)%X == 0){
if ((R%X==0 || C%X==0) && C >X-1 && R > X-1)
{
answer = "Gabrielle";
}
Console.Writeline (answer);
}

The problem can also be solved recursively with the assumption that a square of size greater than X  and edge equal to min (R,C) can be filled by X dominoes and accounting for different orientations and positions of the remaining

Another way to solve the problem is to eliminate R-X and C-X and see if it can be solved for X by X as long as those R-X and C-X are divisible by X.

I am going to get a caprese sandwich and squash soup from yellow dot cafe.

Another way to look at this problem is to check if X+1 is greater than double the  minimum of R and C  in which case Richard wins because essentially you are using a vertical line after X.