In today's post we look at the paper on a graph based approach to skill extraction from text by Kivimaki, Panchenko, Dessy and Verdegem. This paper presents a system that outputs a list of professional skills from a given text and as obtained from LinkedIn social network. The text is reviewed for similarities with the texts of Wikipedia pages and then uses a Spreading Activation algorithm on the Wikipedia graph in order to associate the input document with the skill. We look at just this algorithm. This method consists of two phases : First the query document is associated with the Wikipedia articles using a vector space model and described as a text2wiki phase. Then with these pages and the hyperlink graph of Wikipedia, articles corresponding to the skills are found and related using the algorithm in what is described as a wiki2skill phase. One caveat with this method is that it has to avoid overly general topics or skills and this is done by biasing against hubs.
The text2wiki module relies on Gensim library for text similarity functions based on traditional vector space models. Each text is represented as a vector in a space of 300,000 most frequent terms in the corpus. This is then input to the wiki2skill module where the vector with initial activations a(0) is iterated upon and the activation spread into the neighbouring nodes. Activation refers to the initial work done to identify the nodes of interest. At the end of each iteration, we propagate the activations using the formula:
a(t) = decay_factor.a(t-1) + Lambda. W^pulses. a(t-1) + c(t)
where the decay factor controls the conservation of activation during time
Lambda is the friction factor which controls the amount of activations that nodes can spread to their neighbours
W is the weighted adjacency matrix
The text2wiki module relies on Gensim library for text similarity functions based on traditional vector space models. Each text is represented as a vector in a space of 300,000 most frequent terms in the corpus. This is then input to the wiki2skill module where the vector with initial activations a(0) is iterated upon and the activation spread into the neighbouring nodes. Activation refers to the initial work done to identify the nodes of interest. At the end of each iteration, we propagate the activations using the formula:
a(t) = decay_factor.a(t-1) + Lambda. W^pulses. a(t-1) + c(t)
where the decay factor controls the conservation of activation during time
Lambda is the friction factor which controls the amount of activations that nodes can spread to their neighbours
W is the weighted adjacency matrix
No comments:
Post a Comment