Cluster computing

Friday, September 5, 2014

In today's post I will discuss applying centrality to keyword extraction. In this article, I try to explain how to extract keywords using eigenvector centrality. This method gives us a way to measure the importance of the keywords. First, the words are represented as nodes in a graph. Then words that are connected to each other are given an edge in the graph. The edges are undirected because we only want the adjacency information. The connection between words is given by mutual information. Mutual information is based on pairwise grouping of words and this fits right in with our connections. The mutual information is calculated as the probability of the co-occurrence times the logarithm of the ratio of the probability of co-occurrence to the probabilities of the independent occurrences of the pair of words. With a threshold for the mutual information as a percentage of the range that it falls in, we can select the edges for the graph. This gives us a set of nodes and edges which we represent with an adjacency matrix.

Using the adjacency matrix for a given graph with vertex V and edges E, we find the eigenvector centrality score of the vertex v as 1/ lambda Sum (Xt) which can be written in vector notation as Ax = Lambda X. Lambda is the eigenvalue for which many different eigenvector exist. The highest one gives the maximum importance of the node in the network and is said to be found by power iteration.

Power iteration can work for very large matrix and finds the highest eigenvector for Ax = Lambdax

By iteration we mean we start with an approximate eigenvector b0

In each iteration, we multiply the eigenvector with the adjacency matrix and normalize it. For each iteration, the following is suggested:

Initialize vector b, deviation and n

While ( deviation is greater than tolerance)

Temporary_vector = A * b

Deviation = abs(norm(b)) – n

n = norm(x)

b = Temporary_vector / n

return vector b

Cluster computing

Friday, September 5, 2014

No comments:

Post a Comment