Cluster computing

Saturday, May 31, 2014

We read a paper on Social text normalization using Contextual Graph Random Walks by Hassan and Menezes. This paper describes text normalization that comes in useful for keywords extraction or summarization as well. Text normalization does away with noise, typos, abbreviations, phonetic substitutions, abbreviations and slang. This means the cleaned data will be uniform and more meaningful. Their approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus.
Natural language processing applications often have to deal with noisy text. Text that is dirty has a lot of problems in that it can break the machine learning algorithms, skew the results and may not be representative of the content. Text results are often measured by parameters such as precision and recall. We discussed these in our earlier posts and they are generally meant to determine the relevancy of the results. We saw that these measures were directly affected by the count of successful matches. If the matches are not possible because dirty text gets discounted, skipped or even worse included where it should not be, then the results are no longer accurate.
Furthermore, we saw that we had some elementary steps to mitigate these. For example, we considered, stemming and lemmatization to discover root words and conform different representations to the same word. Such techniques however can be overly general. If the modifications also are supposed to differentiate the words, then such technique does not alone help.
There are other techniques such as a lookup table that may be relevant to the domain of the subject. Such lookup tables simplify the language to a great deal in that we now have fewer and more consistent words.
Lookup tables are an interesting idea, we can associate one or more variations of the word against the same normalized term. This lets us add new variations such as slangs into the table for lookup. Every-time we see a new term , we can include it in the table for use later.
While Lookup tables are great for associations between terms and their normalized word, the normalized word is the neither the lemma nor necessarily the meaning of the word.
To improve that, we could now extend the idea to use a thesaurus or an ontology. These organize words by their related semantics. An Ontology goes further in describing a hierarchy or even categorizing As you can see these refinements are meant to bring in the dirty text into the mainstream for counting towards the measures we have for the language processing applications.

I will continue on my post on text graphs paper later today.

Friday, May 30, 2014

In today's post we look at the paper on a graph based approach to skill extraction from text by Kivimaki, Panchenko, Dessy and Verdegem. This paper presents a system that outputs a list of professional skills from a given text and as obtained from LinkedIn social network. The text is reviewed for similarities with the texts of Wikipedia pages and then uses a Spreading Activation algorithm on the Wikipedia graph in order to associate the input document with the skill. We look at just this algorithm. This method consists of two phases : First the query document is associated with the Wikipedia articles using a vector space model and described as a text2wiki phase. Then with these pages and the hyperlink graph of Wikipedia, articles corresponding to the skills are found and related using the algorithm in what is described as a wiki2skill phase. One caveat with this method is that it has to avoid overly general topics or skills and this is done by biasing against hubs.
The text2wiki module relies on Gensim library for text similarity functions based on traditional vector space models. Each text is represented as a vector in a space of 300,000 most frequent terms in the corpus. This is then input to the wiki2skill module where the vector with initial activations a(0) is iterated upon and the activation spread into the neighbouring nodes. Activation refers to the initial work done to identify the nodes of interest. At the end of each iteration, we propagate the activations using the formula:
a(t) = decay_factor.a(t-1) + Lambda. W^pulses. a(t-1) + c(t)
where the decay factor controls the conservation of activation during time
Lambda is the friction factor which controls the amount of activations that nodes can spread to their neighbours
W is the weighted adjacency matrix

Thursday, May 29, 2014

In Today's Post I Will Talk about assigning weights to keywords in a text. We will use the same strategy as in the paper described earlier. Specifically we will use the number of categories together with the words as subgraphs.

In the paper to discover interesting messages spread across Twitter by using Link analysis, Yang, Lee and Rim, the retweets of messages are leveraged as implicit relationships between Twitter users.
However they look at more than just the sheer number of retweets and use Link analysis.
The retweet count has been used by Hong et al as a measure of popularity and to present classifiers for predicting whether and how often new tweets will be presented in the future Alonso et al used the presence of a URL link as a single highly effective feature for distinguishing interesting tweets with more than eighty percent accuracy. However this paper attempts to go beyond that by modeling the Twitter as a graph consisting of user and tweet nodes implicitly connected by retweet links, when one user retweets what another user retweeted. They use a variation of the HITS algorithm that exploits the retweet link structure as an indicator of how interesting an individual tweet is. What is particularly interesting is that this paper does not treat all retweet links as equal but that some have more importance than others. They demonstrate their study on the real Twitter data. They score the interesting tweets with a ranking.
They model the Twitter structure as a directed graph G with nodes N and directional edges E.
The Graph G has two subgraphs one based only on the user nodes and another based only on the tweet nodes. Instead of running HITS on the tweet subgraph, they run it on the user subgraph and let tweets inherit the scores of their publishers.
They first take the sum of the weighted hub scores corresponding to each user with the weights based on user counts and this sum is treated as the authority score. Then they update the hub scores with the weights as before for the authority score but for a given user. Thus they do this iteratively. In each iteration, the scores tend to converge. After each iteration, the scores are normalized between 0 and 1 by dividing each of them by the square root of the sum of squares of all authority/hub values. At the end we have a users authority score and hub score. This mechanism dampens the influence of users who devote most of their retweet activities towards very few other users and increase the weights of users who retweet to many more. The weights are the ratio of all other users that a user retweeted to all retweet outlinks from the user.

Wednesday, May 28, 2014

Today I will write about a few more trivia on graph theory. There are always an even number of odd vertices in a simple graph. A large number of operations can be defined on collections of graphs. For example, graph sums, differences, powers and even graph eigenvalues can be calculated on these collections. We wanted to use these collections in text mining.
We referred to finding the Laplacian and the eigenvalues. The eigenvalues of a graph are the eigenvalues of its adjacency matrix. The set of eigenvalues of a graph is called the graph spectrum. The eigenvalues of a graph are a special set of scalars associated with a matrix equation that decomposes a square matrix into an equivalent set Each eigenvalue is associated with a so called eigenvalues vector.
If we have a vector X then a square matrix A can be expressed as
AX=Lambda.X
or more succinctly as
(A-Lambda.I).X = 0
and the scalar Lambda is called the eigenvalue of A for the corresponding right eigenvector X.
Given a square matrix A, it can now be represented as A=PD(P^-1)
where P = eigenvalues vector
D= diagonal matrix constructed from eigenvalues
(P^-1) is the inverse of P
.
This form of decomposition is called eigenvalues decomposition.

What this helps us with is the canonicalization of a system to its simplest form where we reduce the number of parameters from n×n to n for the diagonal matrix.

Tuesday, May 27, 2014

This post covers some more interesting trivia about graph theory.
A graph diameter is the longest shortest path between any two graph vertices.
If we take a connected graph or network with a high graph diameter and we add a very small number of edges randomly, the diameter tends to drop drastically. This is known as the small world phenomenon. It is also referred to as the six degrees of separation since any person in a social network of the world turns out to be linked to any other person by roughly six connections.
An acyclic digraph is a directed graph containing no directed cycles, also known as a directed acyclic graph or a DAG. The number of DAGs on n = 1, 2, 3... vertices are 1, 2, 6, 31, ...
In the use of DAGs for communication networks, we frequently encounter problems of information dissemination described for a group of individuals. Gossiping and broadcasting are two such common problems. In gossiping, every person in the network knows a unique item of information and needs to communicate it to everyone else. In broadcasting, one individual has an item of information which needs to be communicated to everyone else.
To illustrate gossiping consider that there are n people each of whom know a specific scandal that's not known to others. They communicate by telephone and whenever two people place a call, they pass on as many scandals as they know. The question is how many calls are needed before everyone gets to know every scandal. We could try this with a value of n = 4 where the participants are labeled A, B, C and D. Then the scandals could spread in this manner {A,B}, {C, D}, {A, C}, {B, D}. The n = 4 solution can then be generalized by adding another player X to the beginning and end of the previous solution thus resulting in {A,E} {A,B}, ... {B,D} {A,E }
We thus see that gossiping takes incremental numbers of calls between participants. Gossiping is also called total exchange and all to all communication. It has applications in communications and distributed systems. For example, a large class of parallel computing problems are addressed with gossiping including Fourier transforms and sorting. If f(n) denotes a function for the number of minimum calls necessary to complete gossiping among n people and where any pair of people can call each other, we see that for
n = 1, f(n) = 0
n = 2, f(n) = 1
n = 3, f(n) = 3
n = 4, f(n) = 4
n = 5, f(n) = 6
and in general f(n) = 2n - 4 for n >= 4
For one way communication where the graph is considered a DAG, the minimum number of calls = f(n) = 2n - 2 for n >= 4