Cluster computing

Thursday, May 29, 2014

In the paper to discover interesting messages spread across Twitter by using Link analysis, Yang, Lee and Rim, the retweets of messages are leveraged as implicit relationships between Twitter users.
However they look at more than just the sheer number of retweets and use Link analysis.
The retweet count has been used by Hong et al as a measure of popularity and to present classifiers for predicting whether and how often new tweets will be presented in the future Alonso et al used the presence of a URL link as a single highly effective feature for distinguishing interesting tweets with more than eighty percent accuracy. However this paper attempts to go beyond that by modeling the Twitter as a graph consisting of user and tweet nodes implicitly connected by retweet links, when one user retweets what another user retweeted. They use a variation of the HITS algorithm that exploits the retweet link structure as an indicator of how interesting an individual tweet is. What is particularly interesting is that this paper does not treat all retweet links as equal but that some have more importance than others. They demonstrate their study on the real Twitter data. They score the interesting tweets with a ranking.
They model the Twitter structure as a directed graph G with nodes N and directional edges E.
The Graph G has two subgraphs one based only on the user nodes and another based only on the tweet nodes. Instead of running HITS on the tweet subgraph, they run it on the user subgraph and let tweets inherit the scores of their publishers.
They first take the sum of the weighted hub scores corresponding to each user with the weights based on user counts and this sum is treated as the authority score. Then they update the hub scores with the weights as before for the authority score but for a given user. Thus they do this iteratively. In each iteration, the scores tend to converge. After each iteration, the scores are normalized between 0 and 1 by dividing each of them by the square root of the sum of squares of all authority/hub values. At the end we have a users authority score and hub score. This mechanism dampens the influence of users who devote most of their retweet activities towards very few other users and increase the weights of users who retweet to many more. The weights are the ratio of all other users that a user retweeted to all retweet outlinks from the user.

Wednesday, May 28, 2014

Today I will write about a few more trivia on graph theory. There are always an even number of odd vertices in a simple graph. A large number of operations can be defined on collections of graphs. For example, graph sums, differences, powers and even graph eigenvalues can be calculated on these collections. We wanted to use these collections in text mining.
We referred to finding the Laplacian and the eigenvalues. The eigenvalues of a graph are the eigenvalues of its adjacency matrix. The set of eigenvalues of a graph is called the graph spectrum. The eigenvalues of a graph are a special set of scalars associated with a matrix equation that decomposes a square matrix into an equivalent set Each eigenvalue is associated with a so called eigenvalues vector.
If we have a vector X then a square matrix A can be expressed as
AX=Lambda.X
or more succinctly as
(A-Lambda.I).X = 0
and the scalar Lambda is called the eigenvalue of A for the corresponding right eigenvector X.
Given a square matrix A, it can now be represented as A=PD(P^-1)
where P = eigenvalues vector
D= diagonal matrix constructed from eigenvalues
(P^-1) is the inverse of P
.
This form of decomposition is called eigenvalues decomposition.

What this helps us with is the canonicalization of a system to its simplest form where we reduce the number of parameters from n×n to n for the diagonal matrix.

Tuesday, May 27, 2014

This post covers some more interesting trivia about graph theory.
A graph diameter is the longest shortest path between any two graph vertices.
If we take a connected graph or network with a high graph diameter and we add a very small number of edges randomly, the diameter tends to drop drastically. This is known as the small world phenomenon. It is also referred to as the six degrees of separation since any person in a social network of the world turns out to be linked to any other person by roughly six connections.
An acyclic digraph is a directed graph containing no directed cycles, also known as a directed acyclic graph or a DAG. The number of DAGs on n = 1, 2, 3... vertices are 1, 2, 6, 31, ...
In the use of DAGs for communication networks, we frequently encounter problems of information dissemination described for a group of individuals. Gossiping and broadcasting are two such common problems. In gossiping, every person in the network knows a unique item of information and needs to communicate it to everyone else. In broadcasting, one individual has an item of information which needs to be communicated to everyone else.
To illustrate gossiping consider that there are n people each of whom know a specific scandal that's not known to others. They communicate by telephone and whenever two people place a call, they pass on as many scandals as they know. The question is how many calls are needed before everyone gets to know every scandal. We could try this with a value of n = 4 where the participants are labeled A, B, C and D. Then the scandals could spread in this manner {A,B}, {C, D}, {A, C}, {B, D}. The n = 4 solution can then be generalized by adding another player X to the beginning and end of the previous solution thus resulting in {A,E} {A,B}, ... {B,D} {A,E }
We thus see that gossiping takes incremental numbers of calls between participants. Gossiping is also called total exchange and all to all communication. It has applications in communications and distributed systems. For example, a large class of parallel computing problems are addressed with gossiping including Fourier transforms and sorting. If f(n) denotes a function for the number of minimum calls necessary to complete gossiping among n people and where any pair of people can call each other, we see that for
n = 1, f(n) = 0
n = 2, f(n) = 1
n = 3, f(n) = 3
n = 4, f(n) = 4
n = 5, f(n) = 6
and in general f(n) = 2n - 4 for n >= 4
For one way communication where the graph is considered a DAG, the minimum number of calls = f(n) = 2n - 2 for n >= 4

Monday, May 26, 2014

In this post, I want to talk about a coloring problem of graph theory. Let us take the problem of coloring a grid. No grid cell may touch another cell of the same color. Each color when used has a cost. The goal is to color the map as cheaply as possible. If we take the cells to be represented by vertices, no adjacent vertices can have the same color. Such a coloring is known as minimum vertex coloring A vertex coloring of a graph k or fewer colors is said to be a k-colorable graph. The chromatic number is the number of colors used. There's said to be a theorem called the four color problem that states that any planar graph can have a chromatic number of at most 4. This was first conjectured in 1852 but its difficult to prove manually. Microsoft Research folks announced in 2004 that they had verified a proof by Robertson et al by formulating the problem in equational logic and conforming the validity of each step.
Some more interesting facts about a simple planar graph.
The number of edges <= 3V - 6
The degree of any vertex <= 6
The set E of the edges of a graph without any loops represents the adjacency relation on V. By definition such a relation is irreflexive ( not onto to the same vertex ) and symmetric.
Edges can be traversed in paths, cycles, trails and circuits.
A walk is an alternating sequence of vertices and connecting edges.
A path is a walk that does not included any vertex twice except that the first might be the last as well.
A cycle is a path that begins and ends on the same vertex.
A trail is a walk that does not pass over the same edge twice though it can cross the same vertex again.
A circuit is a trail that begins and ends on the same vertex.
Alongwith spanning trees, other graph traversal methods include Eulerian and Hamiltonian cycles, network flows from source to sink, and graph colorings where edges or vertices are colored. The network flow problem can be shown to be solved in O(n^3)
An Eulerian cycle or cirucit is a trail that starts and ends at the same vertex. It is a cycle which uses each graph edge exactly once.
Euler showed that a connected graph has an Eulerian cycle iff it has no graph vertices of odd degree.
A Hamiltonian cycle or circuit is a graph cycle that visits each node exactly once. A connected graph with two vertices does not have a Hamiltonian cycle. A graph with three vertices has two Hamiltonian cycles. A graph with four vertices has ten Hamiltonian cycles
The problem of finding a Hamiltonian cycle is NP-complete. If a general graph is given, then it is exhaustively searched to find the Hamiltonian cycle. Rubin's method describes an efficient search procedure that reduces backtracking and guesswork in finding some or all Hamiltonian paths. All Platonic solids also called regular solids which are composed of congruent convex regular polygonal faces have Hamiltonian paths
With the above description on graphs, the coloring of the vertexes and the different ways to traverse the graphs, we wanted to find the keywords in a text and to list them such that we can summarize the text.
However the path to the vertices and the traversal seem to be less useful than finding the keywords themselves either with coloring or by some other ways of costing. When we used the ontology of words such that we could find similarity measures, it considered only pair wise words along their semantical relationship but it did not consider the co-occurrences as it is available in the given text. If we find that the words are similar because they appear in a thesaurus or can be categorized in a hierarchical classification, it is less useful than when they appear with other words as the author intended. If the text is not short, then these co-occurrences matter more and could be considered more reliable than the graph method. So at the very least, we should not consider that the graph based traversal of words as a substitute for the higher dimension based clustering methods. That said graph based traversals need not be with ontologies only.

Saturday, May 24, 2014

In this post, we will cover some graph theory problems from the book by Douglas B. West and another by Reinhard Diestel.
A graph consists of a vertex set V(G), an Edge set E(G) and a relation that associates with each edge two vertices that are its endpoints.
If we have three vertex sets 1,2,3,4 and 5. then there can be many edges that are associated with the vertices, some redundant and some forming loop. A graph is considered complete when set of vertex have all edges, i.e each pair has an edge. A simple graph is one that contains no loops and has no multiple edges. A simple graph may be either connected or disconnected. A connected graph is one in which there is a path from any one point to another point. A graph that is not connected is a disconnected graph.
Graphs have an interesting problem set where vertices can be colored. Together these sets of problems that involve associating a color to the vertex, are called coloring problems.
If there are only two colors to the paths, then their chromatic number is 2. Why is coloring important because it lets us discern some vertices from others. We can also arrange or treat these vertices differently. For example, a bipartite graph also called a bigraph is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge.
In any case, returning to our discussion on why we wanted to visit graph theory for keyword extraction or text summarization. We started with an assumption that every keyword contributes a meaning to the text while holding some relation with the others. This relation can be described in the form of competition, weighted competition, cooperation or are isolated and independent from each other. In other words, there are linkages between the keywords we want to exploit with the help of graphs. Moreover, the graph in the given text can be partitioned based on the eigenvalues and eigenvectors of their adjacency matrices. To compute the eigenvalues we start with the Weighted Laplacian Matrix. A Laplacian matrix is one where the diagonal elements are the degree of each node. For example, if a node 1 has 4 edges then the degree of the node 1 is four. When the node i is adjacent to the node j, then the value is -1, otherwise the values are zero. We could also normalize the Laplacian Matrix. This matrix serves can serve as an input to an algorithm that partitions the graphs. These ideas come from such studies in biodiversity.
We will also attempt to explore directed graphs in the text.
For the undirected graphs, what I wanted to do was to establish relations between terms in a text using similarity or distance measures such as with a thesaurus or WordNet. These become the edges in the graph and the higher the degree of the vertices the higher their candidature as a keyword.
Once we establish the set of keywords we want to work with, we could then work with that graph or partition it further.

I'm taking a break for this long weekend.

Friday, May 23, 2014

We explore pair wise relationships between words in text indexing and summarization using graph theory. Documents are often treated as a bag of words and represented by term vectors with limited dimensions. They are then clustered to extract the co-occurrences or the semantic relationships between terms in a document, thus helping with keyword extraction, summarization or classification. Such techniques have relied on latent and probabilistic latent semantic analysis. However words in a text not only represent the content of a document but they bear some ranking and relationship with each other in a document that could be represented with graphs. Graphs enable the discovery of connected salient words in a bag. While similar in nature to clustering and occasionally used together with it, graphs can be used to model both syntax and semantics such as typed features and hierarchical ontology. However, the focus of this articles is the usage of graph to connect words within a document and to retrieve the salient words based on the weights and the minimum spanning trees. The assignment of weights enables keyword extraction and the spanning of those keywords enable summarization.
First we lay some premise on which we will model our approach. We state that text can be both verbose or terse but can have the same content in terms of the import. The same text can have more than one nuances thereby requiring disambiguation either with context or with categorization. Text need not have meaning and can be a clutter of jargons with no equivalent synonyms. That is why no matter how we represent the semantics of the text, the model should be anchored to the terms in the text. In other words, we take the keywords from within the documents even if we represent their semantic model with alternate words. We focus exclusively on a given language and leave the variations that arise in syntax or expressions out of the scope of the samples we investigate. We take simple samples at first - say blog posts regarding a technical topic. We are keen about the format of the text that is we look at some syntax if it helps our model but we don't rely on the distribution or clustering. We continue to treat the document as a collection of terms without emphasis on where the text are tiled, segmented or paragraphed. We ignore the layout of the text and focus instead on the weights of the terms, the densities and the divergence of topics. Initially we begin with a single graph for our document while we recognize the possibility of several sub-graphs in a document. We will attempt to find a model and we will have metrics for our graph. For example if we have two sets of keywords constituting two different graphs for the same keyword, we will keep the one that has a smaller minimum spanning tree. We know there are usages for our graph but we establish the criteria to come up with a graph.