Cluster computing

Monday, October 3, 2016

We were discussing graphs and text analysis.

The main difference between the thesaurus and the bag of words model is that the former selects words from the thesaurus for semantic relationships while the latter selects the words based on collocation in contexts.

We can consider this in two steps:

1) we select the words from the thesaurus just as we would for the lexical chain

2) we substitute these words in the negative sampling of Goldberg's PMI as discussed here.

Subramanya, Petrov and Pereira (2010) showed a way to use PMI directly in graphs. We start with a word similarity graph. The similarity graph is used during a training of CRF on the target domain. Local sequence contexts in the form of n-grams are represented as graph vertices. The part of speech (POS) of a word occurrence is determined by its local context as can be observed from text. For each n-gram, a set of context features is extracted whose values are the pointwise mutual information (PMI) between the n-gram and its features. Each node is represented by the PMI vectors and the similarity function is a cosine of these vectors. The neighbors of the nodes are used as features for the CRF and this results in a "word embedding" of the larger contextual information in the model. The graph is then used to discover new features, to propagate adjustments to the weights of know features and to train the CRF in a semi -supervised manner.

#codingexercise
Given a String and a pattern find all anagrams of the pattern in the original string. You need to print all the index of the location where the match was found

List<int> Match(string text, string pattern)
{
var h = new Hashtable<char, int>();
for (int i = 0; i < pattern.Length; i++)
    h[pattern[i]]++;
var ret = new List<int>();
for (int i = 0; i < text.Length; i++)
{
    var m = h.Copy();
    for (int j = i ; j < i + pattern.count && j < text.length; j++)
           if (m.keys.Contains(text[j]))
                m[text[j]]--;
    if (m.Values.All(x => x == 0))
         ret.Add(i);
}
return ret;
}

We could also do this in linear time:
List<int> GetIndexes(String pattern, string text)
{
    var h = new Hashtable<char, int>();
    var window = new Hashtable<char,int>();
    for (int i = 0; i < pattern.count; i++)
    {
       h[pattern[i]]++;
       window[text[i]]++;
    }
    for (int i = pattern.Count; i < text.Count; i++)
    {
        if (compare(h, window) == true)
            ret.Add(i-Pattern.Count);
        if (window.contains(text[i]))
             window[text[i]]++;
        else
             window[text[i]] = 1;
        int k = i - pattern.count;
        if (window.contains(text[k]))
             window[text[k]] --;
        else
             window[text[k]] = -1;
    }

    // Check for the last window in text
    if (compare(h, window) == true)
         ret.Add(text.count-pattern.Count);
}

Find the minimum distance between integers x and y in an array of integers
int GetMinDist(List<int> A, int x, int y)
{
int min = INT_MAX;
var prev = A.IndexOf( t => t == x || t == y);
for (int i =0 ; i < n; i++)
{
if (A[i] == x || A[i] == y)
{
if ( prev != -1 && A[prev] != A[i] && (i - prev) < min )
{
min = i - prev;
}
prev = i;
}
}
return min;
}

Find the second highest score from student performances
select max(score) from students where score not in (select max(score) from students)

The above works similarly for n-1 th score.

Cluster computing

Monday, October 3, 2016

No comments:

Post a Comment