Cluster computing

Sunday, June 22, 2014

In tonight's blog post, we revert to the discussion on open graphs and matrix operations. We talked about matrix multiplication which is written in the form : (a1, a2, ... an in the same row) (x1, x2 ... xn in the same column) as resulting in a1x1 + a2x2 + ... + anxn. Note that this product of matrix can also be represented by A(b1, b2, .. , bn) = Ab1, Ab2, Ab3, .. Abn that is the matrix A acts separately on each column of B.
Row reduction is another operation.
This can be done in one of three different ways :
1) Multiplying a row by non-zero scalar.
2) adding a multiple of one row to another
3) swapping the position of two rows.
Each of these steps are also reversible so if you start from one state and go to the other, you can undo the change. This is called row-equivalent operations.
A matrix is said to be in a row-echelon form if any rows made completely of zeroes lies at the bottom of the matrix and the first non-zero entries of a staircase pattern. The first non-zero entry of the k+1 th row is to the right of the kth row.
If the matrix is in a row-echelon form, then the first non-zero entry of each row is called a pivot and the columns in which pivots appear is called pivot columns.It is always possible to convert a matrix to row-echelon form. The standard algorithm is called Gaussian elimination or row reduction.
The rank of a matrix is the number of pivots in its reduced row-echelon form.
The solution to Ax = 0 gives the pivot variables in terms of the free variables.

Saturday, June 21, 2014

In today's post we take a short break to discuss spatial and temporal control for a storage management. First we discuss spatial control. In terms of access to and from disk, sequential access is ten to hundred times faster than random access and more. Disk density has been doubling. Also bandwidth increases as the square root of density. As a result storage managers often organize large data in a way that it can be accessed sequentially. In fact database management systems exercised full control on how the database are partitioned on disk.
The best way for a storage manager such as a DBMS to do that is to lay it out on the disk itself and avoid the file system. This works especially well when raw disk device addresses correspond to physical proximity. But avoiding the file system has the following drawbacks. First it requires attention by a DBA and resource usage such as an entire disk partition. Second raw disk access is often OS specific and introduces portability concerns Third logical file managers such as RAID and SAN have become popular. Due to the presence of these interceptors to raw disk addresses that virtualize these, alternatives have to be considered. For example, we can create a very large file on disk and manage offsets. This is facilitated by the file system and also by virtualized storage managers.This improves performance to the point where the degradation in using a single file on a commercially large system was found to be about 6%
We now look at temporal buffering which is about when the data gets written and not about where the data gets written. If the DBMS wants to control when to write, it can get harder with OS implementing buffering mechanisms. Some of the challenges include the difficulty in guaranteeing ACID transaction promise because the transaction manager cannot guarantee atomic recovery on software and hardware failures without explicitly controlling the timing and ordering of disk writes. For example, the writes to the log device must precede corresponding write to the database device such as with the writeahead logging protocol. The second set of concerns with OS Buffering is about performance as opposed to correctness. The file system wants to read ahead speculatively and write behind with delayed batch writes which are poorly suited for a storage manager. A DBMS for instance can predict the IO decisions based on future read requests such as when reading ahead to scan the leaves of the B+ tree. Similarly when writing we could control say when we flush the log tail by making decisions about the locks or IO throughput. Finally, there may be double buffering and CPU overhead of memory copies. Copies degrade performance by contributing to latency, consuming CPU cycles and potentially flooding the CPU.
Courtesy: Paper by Hamilton, Krishnamoorthy et al.

Friday, June 20, 2014

I is another operation ill review matrix operation on graphs. When we represent graphs as matrices, we can now use mathematics and software tools to find patterns. This is easier than visually analyzing a complicated graph. The matrix representations are usually square but they can be three dimensional also consisting of rows columns and levels or slices. Matrices can also be symmetric or unsymmetric. In symmetric matrix the value in row I column t is the same as row t column I. As an example of an unsymmetric matrix, consider social relations where jack may have a liking for Jill but Jill may not have a liking for Jack.
Operations permitted on a matrix include the following
Permutation here the rows and columns are rearranged such as in a symmetric matrix to find patterns. With these reorderings, the pairwise relationships are not affected.
Partitioning is another operation where the matrix is split into blocks or images.
Embedding is another operation that helps finds groups of nodes that participate in different patterns

Thursday, June 19, 2014

In the previous post, we described a Chinese Whispers clustering algorithm. We now study the algorithm with two different approaches. The first approach is what we had discussed earlier as using small world edge weights.The outcome of this is similar to the min-cut algorithm where dense regions are considered clusters. But unlike min-cut there is no hierarchical clustering and instead gives a flat partition. It does not require any threshold as input parameter and is more efficient. The second approach is based on Matrix operations. This algorithm can be considered a special case of Markov Chain Clustering. MCL is the parallel simulation of all possible random walks up to a finite length on a graph G. Walks will terminate usually in the same cluster that they started from. This is an observation that comes from MCL. MCL simulates flow in the graph by repeatedly updating transition probabilities between all nodes eventually converging to a transition matrix after k steps. The updates to the transition probabilities between all nodes is done in two steps and then these two steps are iterated. The first step is the expansion step. This step involves the matrix multiplication of the graph matrix with the current transition matrix. The second step is the inflation step where the contrast between small and large transition probabilities is increased and normalized to unity thus providing a way to segment the matrix. The inflation step occurs in column wise operations.
The k-matrix multiplications of the expansion step has a significant cost.
In the early stages of the iterations, the matrices operated on are dense. But as the iterations proceed and with the help of a strong inflation operator, matrices in the later steps tend to be sparse. As an optimization the inflation step cost can be lowered by using only a few entries per column and can even be reduced to a single largest value.
The Chinese Whispers algorithm can now be described in matrix operations as follows:
Lets start with the identity matrix of dimension n
for each of the t iterations
we run a maxrow operator on all entries of a row to zero except the largest entry which is set to 1
then we multiply this resulting matrix with the adjacency matrix of the graph and use the result as the input to the next iteration.
Since the time complexity is dependent on the n non-zero entries in the result of the max-row operation, in the worst case this could equal the time complexity of MCL which is a fully connected graph.

We continue our discussion on Chinese Whispers graph clustering program. Its a bottom up clustering algorithm where the nodes start out with different class labels and coalesce to their strongest neighborhood class labels. We determine the strongest labels by the node with the maximal sum of edge weights. The clustering usually proceeds for a small number of iterations. The clustering here is different from clustering vectors because there is no distance metric and consequently no centroid or average. The similarity is measured in terms of edges. Chinese Whispers is one such graph clustering algorithm. It is often used with small world graphs. A small world graph is a kind of graph in which most nodes are not neighbors of each other and yet most nodes can be reached by every other by a short number of hops. As an example consider of a small world graph, consider a set of nodes on the circumference of a circle.
In the small world graphs clustering, sometimes a random mutation rate is introduced that assigns new classes and this is done with decreasing probability This greatly improves the convergence in early iterations. If we take a graph with eleven nodes where the nodes are arranged in two connected sub-graphs with some edges connecting to each other to form a small world graph, this technique converges the class labeling in just two iterations because the mutation acts as a test that then facilitates the proper labeling.
The algorithm does not cross component boundaries because there are no edges between nodes belonging to different components. This means that nodes that are not connected by any edges are discarded by this algorithm which leaves a portion of the nodes unclustered.
We see that the algorithm does not always converge as in the case of labeling a node with a tie i.e it could be assigned one or another class label with the same edge weights. This wouldn't happen if the edges were weighted in the graph. Also the classes generally don't update much after a few iterations.
Note that this simple technique does not take any parameters and the class labels are hard. If soft labels are required then we can assign a class distribution.This can be done as a final step by taking a weighted distribution of hard labels.

Tuesday, June 17, 2014

In the JoBim Text project Gliozzo et al introduced an interactive visualization component. JoBim is an open source platform for large scale distributional semantics based on graph representation. A distributional thesaurus is computed bipartite graphs of words and context features. For every sentence, a context is generated for semantically similar words. Then the capabilities of the conceptualized text is expanded in an interactive visualization.
The visualization can be used as a semantic parser as well as disambiguation of word senses that is induced by graph clustering.
The paper comes from the view that the meaning in a text can be fully defined by semantic oppositions and relations between words. To get this knowledge, co-occurrences with syntactic contexts are extracted from a very large corpora. This approach does not use a quadratic to compute the word similarities. Instead it uses the contemporary MapReduce algorithm. The advantage is that the MapReduce algorithm works well on sparse context and scales to large corpora.
The result is a graph that connects the most discriminative context to terms with explicit linking between the most similar terms. This graph represents a local model of semantic relations for each term. Compare this with a model of semantic relations with fixed dimensions .In other words, this is an example of ego networks. This paper describes how to compute a distributional thesaurus and how to contextualize distributional similarity.
The JoBim framework is named after the applied observations of terms (Jos) and context (Bims) pairs with edges. This operation of splitting observations into JoBim pairs is referred to as the holing operation.
The significance of each pair (t,c) is computed and then only the p most significant pairs are kept per term t resulting in the first order graph. The second order graph is extracted as the similarity between two Jos. This similarity is based on the number of salient features the two Jos share. The similarity over the Jos defines a distributional thesaurus and the paper says this can be computed efficiently in a few MapReduce steps and is said to be better than other measures. This can be replaced by any other mechanism as well as the paper proceeds to discuss the contextualization which as we know depends a lot on smoothing. There are many term context pairs that are valid and may or may not be represented in the corpus. To find similar contexts, they expand the term arguments with similar terms. But the similarity of these terms depend on context. The paper therefore leverages a joint inference to expand terms in context using a marginal inference in conditional random field (CRF) CRF works something like this. A particular word, x is defined as a single definite sequence of either original or expanded words. Its weight depends on the degree to which the term context associations present in this sentence are present in the corpus as well as the out of context similarity of each expanded term to the corresponding term in the sentence. The proportion of the latter to the former is specified a tuning parameter.

We will now look at word sense induction, disambiguation and cluster labeling. With the ranking based on contextually similar terms, there is some implicit word sense disambiguation. But this paper addresses it explicitly with word sense induction.

OK so we will cover some more of this in tonight's post.
The authors mention a WSI technique and use information extracted by IS-A pattern to label clusters of terms that pertain to same taxonomy or domain. The aggregated context features of the clusters help attribute the terms in the distributional thesaurus with the word cluster senses and these are assigned in the context part of the entry.
The clustering algorithm they use is called the Chinese Whispers graph clustering algorithm, which finds the number of clusters automatically The IS-A relationships between terms and their frequency is found from part of speech. This gives us a list of IS-A relationships between terms and their frequencies. Then we find clusters of words that share the same word sense. The aggregates for the IS-A relationships for each cluster is found by summing the frequency of the hypernyms and multiplying this sum by the number of words in the cluster that elicited this hypernym. This results in the labels for the clusters that is taxonomic and provides an abstraction layer over the terms. For example, jaguar can be clustered into the cat sense or car sense and the highest scoring hypernyms provide a concise description of these senses. The occurrences of ambiguous words in context can now be disambiguated to these cluster senses.
The visualization for the expansion of the term context pairs uses the Open Domain Model which is trained from newspaper corpora.

We will next talk about Interactive Visualization Features but before we do that let us first talk about the Chines Whispers clustering algorithm.

The Chinese Whispers is a randomized graph clustering algorithm (Biemann). The edges are added in increasing numbers over time. The algorithm partitions the nodes of weighted undirected graphs. The name is derived from a children's playing game where they whisper words to each other. The game's goal is to arrive at some funny derivative of the original message by passing it through several noisy channels.
The CW algorithm aims at finding groups of nodes that broadcast the same message to their neighbors.
The algorithm proceeds something like this:
First assign all the vertices to different class
while there are changes,
for all the vertices taken in randomized order:
class(v) = highest ranked class in neighborhood of v;
Then the nodes are processed for a small number of iterations and inherit the strongest class in the local neighborhood. This is the class whose sum of edge weights to the current node is maximal.

Having discussed centrality measures in social graph, let us look at some more social Graph techniques. I'm going to discuss ego network. Ego network has to do with the individual as the focus. But before we delve into that, I will review the idea of embedding in a network. Embedding is the extent to which individuals find themselves in dense strong ties. These ties are reciprocative and they suggest some measure of constraints on individuals from the way that they are connected to others. This gives us an idea of the population and some subtleties but not so much about the positives and negatives facing the individual. If we look at the context of an individual, then we are looking at a local scope and this is the study of ego networks. Ego means an individual focus as in a node of the graph. There are as many egos as there are nodes. An ego can be a person, group, Or organization. The neighborhood of an ego is the collection of all egos around the individual to which there is a connection or a path. The connections need not be one step but they usually are. The boundaries of an ego network are defined in terms of neighborhood. Ego networks are generally undirected because the connections are symmetric. If they are different, then it's possible to define an in network and an out network. The ties in an in network are inwards and those in the other network are outwards. The strength and weakness of the ties can be defined in probabilities. Using these we can define thresholds for the ones we want to study. Finding these ties is a technique referred to as data extraction. The second technique is subgraph extraction. The network density of an ego network can be represented by the number of indexes for each ego in a dataset. We will review more graph techniques shortly.
Courtesy: Hanneman online text.