Cluster computing

Thursday, October 6, 2016

We were discussing the use of graphs in natural language processing: One of the frequently used techniques is label propagation. It is a method of self-supervision which allows the labels of a small set of annotated data to spread in a consistent fashion to unlabeled data.
The graph is seeded with known positive and negative samples and then each node is labeled in proportion to the percentage of times a random walk on the graph ends at that node. Only those nodes gets a high score that are both similar to the seed nodes and are central to the document set. The graphs consist of both sentences and features and is bipartite as the sentences only connect with features and vice versa.

Label propagation has several uses. One of the applications is to find translations for out of vocabulary words. Initially a graph is constructed from the available text and phrase table. Each vertex is a phrase type or some extracted information from the text. It is connected to other vertices with edges that are formed from some similarity measure between the two profiles and filtered based on a threshold value. There are three types of vertices - labeled, unlabeled and out of vocabulary. Nodes for which the translations are available are annotated with target-side translations and their feature values. The algorithm propagates the translations from labeled nodes to unlabeled nodes. This handles different types of out of vocabulary words. The graph constructed is very large and the methods are scalable.

What the label propagation makes the assumption that edge weights encodes the degree of belief about the similarity of the soft labeling for the connected vertices. Therefore the soft labels for the unlabeled vertices can be computed from the labeled vertices.

The graph used with the label propagation method is essentially a node similarity graph with the assumption that the connected vertices have edges based on some similarity. If we take a thesaurus and apply it directly to this graph, it is rather restrictive. However, if we take the PMI matrix and apply it to the graph, we get better information. The PMI is based on both positive and negative samples and can even include terms from the thesaurus as negative samples. Then with the similarity graph, we can use a CRF we build that has heterogeneous features of interest including a feature for the thesaurus. When we tune the CRF with the similarity graph, we can then use it for the text that becomes available.

Courtesy: Survey of graphs in NLP by Nastase and Mihalcea

#codingexercise
Print the second non-repeating character
Void printSecondNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false && nr.count > 1)
Console.write("Second non-repeated char is {0}", nr.first().next);
Current = s.Read();
}

}

Cluster computing

Thursday, October 6, 2016

No comments:

Post a Comment