Cluster computing

Monday, May 21, 2018

Today I wanted to take a brief break to discuss the relevancy of incremental and hybrid graph algorithms. BFS was just one instance. We don't use BFS in NLP processing of text. However, we do use graph algorithms.
Many of the documents we process for text analysis are actually very small content when compared to the corpus on which our model is trained. The graph algorithms we use in our models therefore operate on a large number of keywords from the corpus. The keywords in the document are going to be much smaller. However consider the case we are now no longer limited to the number of fixed dimensions for a word vector. In this case we can use all the edges for every keyword since we use persisted relationships. By making these same nlp graph algorithms as incremental, we now offer the ability to perform operations as updates occur. The store-and-static-compute model worked because updates were batched and then graph processing applied on static snapshots from different points in time. It worked so long as the graph modifications were less frequent than static processing. Moreover, nlp graph algorithms are known to be computationally expensive. By making these special case algorithms incremental and parallel, we expand the horizon of commerical practicality for many theoretical graph algorithms. This is a significant win not only for existing algorithms but also new ones we can come up with.
Graph algorithms in nlp has a tremendous history. Semantic word embeddings are popular in nlp graph algorithms. Bipartite graphs are yet another well-known graph usage in nlp. Graphs are directly applicable in syntactic and semantic relations. WordNet is a great example. It's a semantic network that is heavily used for word sense disambiguation, semantic similarity, question answering, and others. There have been efforts to expand or improve this ontology by connecting different domains such as in linkeddata.org. This uses the web to connect related data that wasn't previously linked. It connects data on the Semantic Web using URIs and RDF. Graphs have also come to be different from the traditional nodes and vertices with the introduction of heterogeneous graphs, graphs with multilayered edges etc.
#codingexercise:
Canonicalize arrays:
var items = { "1" : null, "2": 0, "3": "", "4":0, "5": 55, "6": 0, "7":0, "8" : 0, "9": 0 }
var keys = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10" ];
var present = [];
Object.entries(items).forEach(([key, value]) => {
present+= [key];
if (value === "" || value === null ){
items[key] = 0;
}
});
var missing = keys.filter( x => !present.includes(x));
missing.forEach (x => items[x] = 0);
document.write(JSON.stringify(items));

{"1":0,"2":0,"3":0,"4":0,"5":55,"6":0,"7":0,"8":0,"9":0,"10":0}

Sunday, May 20, 2018

We were discussing the parallel Breadth-First-Search using the 2D partitioning approach.
There were two approaches discussed:
The first approach was about partitioning
The second approach was about mitigating communication overhead
The second approach made use of
1) 2D decomposition and
2) parent update
In the second approach, the idea is that with each BFS iteration is equivalent to a sparse matrix -sparse vector multiplication (SpMSV). The matrix is the adjacency matrix which is decomposed equally into a number of processors. The vector is the frontier with integer variables. The algorithm does not store the previous frontiers explicitly as multiple sparse vectors. Instead it computes the breadth first spanning tree by returning a dense parents array.
The parent update is peculiar to this algorithm. The parents array is used in the element-wise with the result of SpMSV.
The parent is always chosen to be a vertex with the highest label.
The inner loop performs a single level traversal
All vectors and input matrix are distributed 2D wise. For example, f which is an initially sparse vector. Each n/p sized piece of the vector takes a progressive section of the adjacency matrix.
The 2D decomposition of the adjacent matrix means that the matrix is viewed as a grid of cells where each cell is of size Pr × Pc and assigned to a processor P (i, j)
The vector operations are also parallelized using loop-level parallelism.
In the 2D processor assignment, the subvectors are accumulated at all the processors on the jth column and each processor after performing its SpMSV scatters the corresponding piece of its intermediate vector to its owner along the ith processor row.
#codingexercise
Write a function to set the keys with invalid values to a default value in a javascript object:
var items = { "1" : null, "2": 0, "3": "", "4":0, "5": 55, "6": 0, "7":0, "8" : 0, "9": 0 }
Object.entries(items).forEach(([key, value]) => {
if (value === "" || value === null ){
items[key] = 0;
}

});
document.write(JSON.stringify(items));

{ "1" : 0, "2": 0, "3": 0, "4":0, "5": 55, "6": 0, "7":0, "8" : 0, "9": 0 }

Saturday, May 19, 2018

We were discussing the parallel Breadth-First-Search using the 2D partitioning approach.
This is the latter case of two highly tuned parallel approaches for BFS on large parallel systems:
1) a level synchronous strategy that relies on a simple vertex based graph partitioning and
2) a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead.
The first approach was about partitioning
The second approach was about mitigating communication overhead
In the second approach, the idea is that with each BFS iteration is equivalent to a sparse matrix -sparse vector multiplication (SpMSV). The matrix is the adjacency matrix which is decomposed equally into a number of processors. The vector is the frontier with integer variables. The algorithm does not store the previous frontiers explicitly as multiple sparse vectors. Instead it computes the breadth first spanning tree by returning a dense parents array. The parent is always chosen to be a vertex with the highest label. The parents array is used in the element-wise with the result of SpMSV. The inner loop performs a single level traversal
All vectors and input matrix are distributed 2D wise. For example, f which is an initially sparse vector. Each n/p sized piece of the vector takes a progressive section of the adjacency matrix. The vector operations are also parallelized using loop-level parallelism.
In the 2D processor assignment, the subvectors are accumulated at all the processors on the jth column and each processor after performing its SpMSV scatters the corresponding piece of its intermediate vector to its owner along the ith processor row.
#codingexercise
Write a function to reduce the keys with 0 values in a javascript object:
var items = { "1" : 0, "2": 0, "3": 0, "4":0, "5": 55, "6": 0, "7":0, "8" : 0, "9": 0 }
Object.entries(items).forEach(([key, value]) => {
if (value === "" || value === null || value === 0){
delete items[key];
}

});
document.write(JSON.stringify(items));
{"5":55}

Friday, May 18, 2018

We were discussing the parallel Breadth-First-Search using the 2D partitioning approach.
This is the latter case of two highly tuned parallel approaches for BFS on large parallel systems:
1) a level synchronous strategy that relies on a simple vertex based graph partitioning and
2) a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead.
In the second approach, the idea is that with each BFS iteration is equivalent to a sparse matrix -sparse vector multiplication. The adjacencies can only be atmost N nodes so the vector can only have N dimensions and not all of them are used.
The algorithm computes the breadth first spanning tree by returning a dense parents array. The inner loop performs a single level traversal
The exploration of level k in BFS is equivalent to transposing the adjacency matrix and performing a matrix vector multiplication on an element wise multiplication of a complement to dense parents array. Since the parents array is a boolean array, a complement operation inverts it elementwise. The algorithm always picks the parent with the vertex as the highest label.
All vectors and input matrix are distributed 2D wise. For example, f which is an initially sparse vector. Each n/p sized piece of the vector takes a progressive section of the adjacency matrix. Notice the algorithm is written in a gather-appy-scatter strategy Since the matrix and the vectors are distributed, the algorithm proceeds parallely.
There are two distinct communication phases in a BFS algorithm- an expand phase over the processor column and a fold phase over the processor row.
In any parallelization, we have partition of tasks and communication overhead. If we have non-overlapped partition through out the algorithm, we can have least communication. This works for batch processing map-Reduce method in Big Data. For BFS algorithm there is a tradeoff that needs to be struck between the partitioning and the communication. This approach cited here made the communication efficient but there are other ways to tune a hybrid approach.

Thursday, May 17, 2018

We were discussing the parallel Breadth-First-Search.
Graph-based computations are pervasive in modern day applications where important entities need to be identified and ranked, anomalous patterns need to be detected, sudden changes in networks need to be detected or tightly interconnected cluster of entities need to be found. In addition, they may be highly data intensive. For example, an undirected graph may have upwards of 4 billion vertices and 40 billion edges. A breadth first traversal that can traverse billions of edges / second is then required. This is possible only if we break away from the time blocking sequential access.
There are two highly tuned parallel approaches for BFS on large parallel systems:
1) a level synchronous strategy that relies on a simple vertex based graph partitioning and
2) a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead.
We discuss the second approach today. This involves 2D partitioning. Here the idea is that we each BFS iteartion is equivalent to a sparse matrix -sparse vector multiplication. The adjacencies can only be atmost N nodes so the vector can only have N dimensions and not all of them are used.
The exploration of level k in BFS is equivalent to transposing the adjacency matrix and performing a matrix vector multiplication on an element wise multiplication of a complement to dense parents array. Since the parents array is a boolean array, a complement operation inverts it elementwise. The algorithm always picks the parent with the vertex as the highest label.
The algorithm is defined as:
Procedure BFS_2D(A, s)
f(s) <- s
for all processors P(i,j) in parallel do
while f != null do
TransposeVector(fij)
fi <- all_gather_v( fij , P(:, j) )
ti = Aij matrix-multiply-with fi
tij = all_to_all(ti, P(i,:))
tij = tij element_wise_multiply pi(i, j) complemented
pi(i,j) = pi(i,j) + t(i,j)
fij <- tij

Full discussion : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs
This section courtesy Buluc, Madduri et al.

Wednesday, May 16, 2018

We now proceed to the discussion on parallel Breadth-First-Search.
Graph-based computations are pervasive in modern day applications where important entities need to be identified and ranked, anomalous patterns need to be detected, sudden changes in networks need to be detected or tightly interconnected cluster of entities need to be found. In addition, they may be highly data intensive. For example, an undirected graph may have upwards of 4 billion vertices and 40 billion edges. A breadth first traversal that can traverse billions of edges / second is then required. This is possible only if we break away from the time blocking sequential access.
There are two highly tuned parallel approaches for BFS on large parallel systems:
1) a level synchronous strategy that relies on a simple vertex based graph partitioning and
2) a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead.
These approaches are also used to benchmark supercomputers. For example, Graph-500 is used to rank them on their performance with BFS. Distributed memory systems are a hallmark of supercomputers.
BFS on a distributed memory systems architecture involves communications between the processors and a distribution of graph partitions among them. These graphs are generally sparse and the number of edges is usually to the scale of a constant factor times the number of vertices. The average path length in these graphs is a small value bounded by log of the number of vertices.
The approach 1 is straightforward where the iterations over each vertex and its adjacencies are parallelized.
The distance update and the enqueing are atomic. There is a barrier synchronization step for each level. The optimizations involved include:
1) ensuring that the edge visit steps are load-balanced.
2) the synchronization costs are reduced
3) the locality of memory references is improved.

Full discussion : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

Tuesday, May 15, 2018

We revert back to the discussion on incremental. Breadth-First-Search.
In large social engineering graphs, there are updates of about 86,400 operations per second.The store-and-static-compute model worked because updates were batched and then graph processing applied on static snapshots from different points in time. It worked so long as the graph modifications were less frequent than static processing. With dynamic graph processing, we need a new framework. One such framework proposed was GraphIn which introduces a new programming model called Incremental-Gather-Apply-Scatter.
Incremental-Gather-Apply-Scatter can be applied to sub-graphs.

Full discussion : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

#codingexercise

int GetLongestIncreasingSubsequenceCount(List<int> A)
{
int n = A.Count;
int best = new List<int>(Enumerable.Repeat(0,n));
for (int i = 1; i < n; i++)
for (int j = 0; j < i; j++)
{
if (A[i] > A[j] && best[j] + 1 > best[i]){
best[i] = best[j] + 1;
}
}
return best.max();
}