Cluster computing

In the JoBim Text project Gliozzo et al introduced an interactive visualization component. JoBim is an open source platform for large scale distributional semantics based on graph representation. A distributional thesaurus is computed bipartite graphs of words and context features. For every sentence, a context is generated for semantically similar words. Then the capabilities of the conceptualized text is expanded in an interactive visualization.
The visualization can be used as a semantic parser as well as disambiguation of word senses that is induced by graph clustering.
The paper comes from the view that the meaning in a text can be fully defined by semantic oppositions and relations between words. To get this knowledge, co-occurrences with syntactic contexts are extracted from a very large corpora. This approach does not use a quadratic to compute the word similarities. Instead it uses the contemporary MapReduce algorithm. The advantage is that the MapReduce algorithm works well on sparse context and scales to large corpora.
The result is a graph that connects the most discriminative context to terms with explicit linking between the most similar terms. This graph represents a local model of semantic relations for each term. Compare this with a model of semantic relations with fixed dimensions .In other words, this is an example of ego networks. This paper describes how to compute a distributional thesaurus and how to contextualize distributional similarity.
The JoBim framework is named after the applied observations of terms (Jos) and context (Bims) pairs with edges. This operation of splitting observations into JoBim pairs is referred to as the holing operation.
The significance of each pair (t,c) is computed and then only the p most significant pairs are kept per term t resulting in the first order graph. The second order graph is extracted as the similarity between two Jos. This similarity is based on the number of salient features the two Jos share. The similarity over the Jos defines a distributional thesaurus and the paper says this can be computed efficiently in a few MapReduce steps and is said to be better than other measures. This can be replaced by any other mechanism as well as the paper proceeds to discuss the contextualization which as we know depends a lot on smoothing. There are many term context pairs that are valid and may or may not be represented in the corpus. To find similar contexts, they expand the term arguments with similar terms. But the similarity of these terms depend on context. The paper therefore leverages a joint inference to expand terms in context using a marginal inference in conditional random field (CRF) CRF works something like this. A particular word, x is defined as a single definite sequence of either original or expanded words. Its weight depends on the degree to which the term context associations present in this sentence are present in the corpus as well as the out of context similarity of each expanded term to the corresponding term in the sentence. The proportion of the latter to the former is specified a tuning parameter.

We will now look at word sense induction, disambiguation and cluster labeling. With the ranking based on contextually similar terms, there is some implicit word sense disambiguation. But this paper addresses it explicitly with word sense induction.

OK so we will cover some more of this in tonight's post.
The authors mention a WSI technique and use information extracted by IS-A pattern to label clusters of terms that pertain to same taxonomy or domain. The aggregated context features of the clusters help attribute the terms in the distributional thesaurus with the word cluster senses and these are assigned in the context part of the entry.
The clustering algorithm they use is called the Chinese Whispers graph clustering algorithm, which finds the number of clusters automatically The IS-A relationships between terms and their frequency is found from part of speech. This gives us a list of IS-A relationships between terms and their frequencies. Then we find clusters of words that share the same word sense. The aggregates for the IS-A relationships for each cluster is found by summing the frequency of the hypernyms and multiplying this sum by the number of words in the cluster that elicited this hypernym. This results in the labels for the clusters that is taxonomic and provides an abstraction layer over the terms. For example, jaguar can be clustered into the cat sense or car sense and the highest scoring hypernyms provide a concise description of these senses. The occurrences of ambiguous words in context can now be disambiguated to these cluster senses.
The visualization for the expansion of the term context pairs uses the Open Domain Model which is trained from newspaper corpora.

We will next talk about Interactive Visualization Features but before we do that let us first talk about the Chines Whispers clustering algorithm.

The Chinese Whispers is a randomized graph clustering algorithm (Biemann). The edges are added in increasing numbers over time. The algorithm partitions the nodes of weighted undirected graphs. The name is derived from a children's playing game where they whisper words to each other. The game's goal is to arrive at some funny derivative of the original message by passing it through several noisy channels.
The CW algorithm aims at finding groups of nodes that broadcast the same message to their neighbors.
The algorithm proceeds something like this:
First assign all the vertices to different class
while there are changes,
for all the vertices taken in randomized order:
class(v) = highest ranked class in neighborhood of v;
Then the nodes are processed for a small number of iterations and inherit the strongest class in the local neighborhood. This is the class whose sum of edge weights to the current node is maximal.

Cluster computing

Tuesday, June 17, 2014

No comments:

Post a Comment