Cluster computing

Wednesday, June 25, 2014

A graph of knoke information shows strong symmetric ties. It also answers questions such as how separate are the sub-graph ? How large are the connected sub-graphs.Are there particular actors that appear to play network roles ?
We now look at ways to find sub-graphs. One approach is the bottom up manner. A clique extends the dyads by adding members that are tied to all the members in a group.The strict definition can be relaxes to include nodes that are not quite so tied as we will see shortly with n-cliques, n-clans and k-plexes. The notion however is to build it outwards to construct the network. The whole network can then be put together by joining cliques and clique like groupings.
Formally, a clique is the maximum number of actors who have all possible ties among themselves. It can be considered to be a maximal complete sub-graph. Cliques can be viewed in conjunction with the Knoke information matrix mentioned earlier. We might be interested in the extent to which these sub-structures overlap and which actors are more central or more isolated than cliques.
We can examine these by evaluating the actor by actor clique co-memberships.Then we can do hierarchical clustering of the overlap matrix which gives us an idea of the closeness of the cliques. Visually we can see the co-membership and the hierarchical clusterings as matrices formed from the actor and cliques and levels and cliques respectively.
Two major approaches to relaxing the definition for cliques are the N-cliques and N-clan approaches.
In N = 2, cliques we say that an actor is a member of a clique if it is connected to every other member of the clique at a distance greater than one, and in this case we choose two. The path distance of two corresponds to the actor being a friend of a friend.
The cliques that we saw before have been made more inclusive by this relaxed definition of group membership.
The N-clique approach tries to find long and string like groupings instead of the tight discrete ones by the original definitions. It is possible for actors to be connected through others who are themselves not part of the cliques.
To overcome this problem, some restriction is imposed additionally on the total span or path distance between any two members. This is the N-clan approach where all ties are forced to occur by means of others members of the n-clique.
If we are not comfortable with the idea of using a friend of the clique member as a member of the clique, we can use the K-plex approach. In this approach we say that a node is a member of a clique of size n if it has direct ties to n-k members of that clique. This tends to find a relatively large number of small groupings. This shifts focus to overlaps and centralization rather than solidarity and reach.
Courtesy: Hanneman notes

Monday, June 23, 2014

In this post, we will cover an interesting topic : cliques and sub-groups.
In graphs and networks, one of the common interests is the presence of "sub-structures" that may be present in a network. Neighborhoods and groups of nodes fall into this category. When small compact sub-networks are joined to form large networks in a bottom up manner, we form extended network known as cliques. In cliques there's generally more interaction between the members than with others.
A clique can be considered a closed elite. We can also look for this substructure from the top down. The idea that some regions of graph may be less connected than the whole may lead to insights into cleavage and division. Weaker parts in the social fabric also create opportunities for brokerage and less constrained action.Most computer algorithms for locating sub-structures operate on binary symmetric data. We can use Knoke information exchange data to illustrate these sub-networks with strong ties. Knoke information exchange can be viewed as a binary connectedness values on the adjacency matrix of a directed graph.

I'm taking a short break today as I'm taking my time on another task from work today.

The use of matrices to describe social relations is as follows:
Transform a block operation allows us to select a matrix to be blocked, a row and/or column partition and a method for calculating the entries in a resulting block. We first split the row and column partition. These are just data sets which we then group to form partitioned data sets. This operation requires a method for summarizing the information within each block. The operation outputs two new matrices. The PreImage data set contains the original scores, but permuted. The reduced image data set contains a new block matrix containing the block densities.
The Transform collapse method allows us to combine rows and/or columns by specifying which elements are to be combined and how. Combinations can be maximum, minimum and sum. The result of the combinations is a new matrix with specified operation performed.
The Data -> Permute allows us to re-arrange the rows and/or columns and/or matrices. This operation requires us to list the new orders method needed.
The Data->Sort re-arranges the rows, columns or both of the matrix according to a criterion we select.
The Data-> Transpose re-arranges the data in a way that is very commonly used in matrix algebra and switches the columns with the rows.

Sunday, June 22, 2014

In tonight's blog post, we revert to the discussion on open graphs and matrix operations. We talked about matrix multiplication which is written in the form : (a1, a2, ... an in the same row) (x1, x2 ... xn in the same column) as resulting in a1x1 + a2x2 + ... + anxn. Note that this product of matrix can also be represented by A(b1, b2, .. , bn) = Ab1, Ab2, Ab3, .. Abn that is the matrix A acts separately on each column of B.
Row reduction is another operation.
This can be done in one of three different ways :
1) Multiplying a row by non-zero scalar.
2) adding a multiple of one row to another
3) swapping the position of two rows.
Each of these steps are also reversible so if you start from one state and go to the other, you can undo the change. This is called row-equivalent operations.
A matrix is said to be in a row-echelon form if any rows made completely of zeroes lies at the bottom of the matrix and the first non-zero entries of a staircase pattern. The first non-zero entry of the k+1 th row is to the right of the kth row.
If the matrix is in a row-echelon form, then the first non-zero entry of each row is called a pivot and the columns in which pivots appear is called pivot columns.It is always possible to convert a matrix to row-echelon form. The standard algorithm is called Gaussian elimination or row reduction.
The rank of a matrix is the number of pivots in its reduced row-echelon form.
The solution to Ax = 0 gives the pivot variables in terms of the free variables.

Saturday, June 21, 2014

In today's post we take a short break to discuss spatial and temporal control for a storage management. First we discuss spatial control. In terms of access to and from disk, sequential access is ten to hundred times faster than random access and more. Disk density has been doubling. Also bandwidth increases as the square root of density. As a result storage managers often organize large data in a way that it can be accessed sequentially. In fact database management systems exercised full control on how the database are partitioned on disk.
The best way for a storage manager such as a DBMS to do that is to lay it out on the disk itself and avoid the file system. This works especially well when raw disk device addresses correspond to physical proximity. But avoiding the file system has the following drawbacks. First it requires attention by a DBA and resource usage such as an entire disk partition. Second raw disk access is often OS specific and introduces portability concerns Third logical file managers such as RAID and SAN have become popular. Due to the presence of these interceptors to raw disk addresses that virtualize these, alternatives have to be considered. For example, we can create a very large file on disk and manage offsets. This is facilitated by the file system and also by virtualized storage managers.This improves performance to the point where the degradation in using a single file on a commercially large system was found to be about 6%
We now look at temporal buffering which is about when the data gets written and not about where the data gets written. If the DBMS wants to control when to write, it can get harder with OS implementing buffering mechanisms. Some of the challenges include the difficulty in guaranteeing ACID transaction promise because the transaction manager cannot guarantee atomic recovery on software and hardware failures without explicitly controlling the timing and ordering of disk writes. For example, the writes to the log device must precede corresponding write to the database device such as with the writeahead logging protocol. The second set of concerns with OS Buffering is about performance as opposed to correctness. The file system wants to read ahead speculatively and write behind with delayed batch writes which are poorly suited for a storage manager. A DBMS for instance can predict the IO decisions based on future read requests such as when reading ahead to scan the leaves of the B+ tree. Similarly when writing we could control say when we flush the log tail by making decisions about the locks or IO throughput. Finally, there may be double buffering and CPU overhead of memory copies. Copies degrade performance by contributing to latency, consuming CPU cycles and potentially flooding the CPU.
Courtesy: Paper by Hamilton, Krishnamoorthy et al.

Friday, June 20, 2014

I is another operation ill review matrix operation on graphs. When we represent graphs as matrices, we can now use mathematics and software tools to find patterns. This is easier than visually analyzing a complicated graph. The matrix representations are usually square but they can be three dimensional also consisting of rows columns and levels or slices. Matrices can also be symmetric or unsymmetric. In symmetric matrix the value in row I column t is the same as row t column I. As an example of an unsymmetric matrix, consider social relations where jack may have a liking for Jill but Jill may not have a liking for Jack.
Operations permitted on a matrix include the following
Permutation here the rows and columns are rearranged such as in a symmetric matrix to find patterns. With these reorderings, the pairwise relationships are not affected.
Partitioning is another operation where the matrix is split into blocks or images.
Embedding is another operation that helps finds groups of nodes that participate in different patterns

Thursday, June 19, 2014

In the previous post, we described a Chinese Whispers clustering algorithm. We now study the algorithm with two different approaches. The first approach is what we had discussed earlier as using small world edge weights.The outcome of this is similar to the min-cut algorithm where dense regions are considered clusters. But unlike min-cut there is no hierarchical clustering and instead gives a flat partition. It does not require any threshold as input parameter and is more efficient. The second approach is based on Matrix operations. This algorithm can be considered a special case of Markov Chain Clustering. MCL is the parallel simulation of all possible random walks up to a finite length on a graph G. Walks will terminate usually in the same cluster that they started from. This is an observation that comes from MCL. MCL simulates flow in the graph by repeatedly updating transition probabilities between all nodes eventually converging to a transition matrix after k steps. The updates to the transition probabilities between all nodes is done in two steps and then these two steps are iterated. The first step is the expansion step. This step involves the matrix multiplication of the graph matrix with the current transition matrix. The second step is the inflation step where the contrast between small and large transition probabilities is increased and normalized to unity thus providing a way to segment the matrix. The inflation step occurs in column wise operations.
The k-matrix multiplications of the expansion step has a significant cost.
In the early stages of the iterations, the matrices operated on are dense. But as the iterations proceed and with the help of a strong inflation operator, matrices in the later steps tend to be sparse. As an optimization the inflation step cost can be lowered by using only a few entries per column and can even be reduced to a single largest value.
The Chinese Whispers algorithm can now be described in matrix operations as follows:
Lets start with the identity matrix of dimension n
for each of the t iterations
we run a maxrow operator on all entries of a row to zero except the largest entry which is set to 1
then we multiply this resulting matrix with the adjacency matrix of the graph and use the result as the input to the next iteration.
Since the time complexity is dependent on the n non-zero entries in the result of the max-row operation, in the worst case this could equal the time complexity of MCL which is a fully connected graph.