Sunday, September 21, 2014

We will start reviewing a book titled Programming Collective Intelligence from O'Reilly Media in the subsequent blog posts. This book includes such things as making recommendations, discovering groups, searching and ranking, optimization, document filtering, modeling with decision trees, building price models, advanced classification - Kernel methods and SVMs, finding independent features, evolving intelligence and a summary of algorithms. The sample code is presented in Python for readability.
As an example, a Bayesian classifier can help classify a sentence as language for a use of the word python or as a snake for the use of the same word. It works based on the probability P(category | Document) = P(Document | Category) * P(category)
where P(Document | Category) = P(word1 | category) * P(word2 | category) * ....
All we need is a feature extraction that turns the data we're using for training or classification into a list of features basically something that takes an object and returns a list such as a sentence converted into words. This classifier is trained on the strings. Then it can be used to predict the test data.
 docclass.getwords('python is a dynamic language')
{'python' : 1, 'dynamic':1,  'language':1}
c1 = docclass.naivebayes(docclass.getwords)
c1.setdb('test.db')
c1.train('python has dynamic types', 'language')
c1.train('pythons are constrictors', 'snake')
c1.classify(''boa constrictors')
u'snake'
c1.classify('dynamic programming')
u'language'

I have to take a short break for a coding problem and including it in this post before continuing.
How many different distinct subsequences exist for a string T in string S and what is the most efficient way to find it. For example rabbit appears in rabbbit in three different ways. For every match until the length of the given pattern, we explore the remaining of the given input to see if there's a full match by skipping non-matching and repetititive characters. When the full match is found, we increment a counter. Another way to do this would be to use an automata or Rabin Karp method or KMP method.


void KMP(string pattern, string text, vector<int> *positions) {
    int patternLength = pattern.length();
    int textLength = text.length();
    int* next =  PreProcess(pattern);
 if (next == 0) return;
    int i = 0;
    int j = 0;
    while ( j < textLength )
 {
  while(true)
   if (text[j] == pattern[i]) //matches
   {
    i++;   // yes, move on to the next state
    if (i == patternLength)  // maybe that was the last state
    {
     // found a match;
     positions->push_back(j-(i-1));
     i = next[i];
    }
    break;
   }
   else if (i == 0) break; // no match in state j = 0, give up
   else i = next[i];
  j++;
 }
}

int* PreProcess( string pattern) {
 int patternLength = pattern.length();
 if (patternLength == 0) return 0;
    int * next = new int[patternLength + 1];
    if (next == 0) return 0;
    next[0] = -1;  // set up for loop below; unused by KMP

    int i = 0;
    int j = -1;
    // next[0] = -1;
    // int len = pattern.length();
    while (i < patternLength) {
  next[i + 1] = next[i] + 1;
  while ( next[i+1] > 0 &&
    pattern[i] != pattern[next[i + 1] - 1])
   next[i + 1] = next[next[i + 1] - 1] + 1;
  i++;
 }
    return next;

}
 

No comments:

Post a Comment