Friday, October 3, 2014


Programming Collective Intelligence Book review:

This is a book that I have found useful to understand and apply machine learning to improve the ways I would have gone about designing and developing web applications and services. This book includes such things as making recommendations, discovering groups, searching and ranking, optimization, document filtering, modeling with decision trees, building price models, advanced classification - Kernel methods and SVMs, finding independent features, evolving intelligence and a summary of algorithms. The sample code is presented in Python for readability. I present a review of the chapters corresponding to the above topics:

Recommendations: Recommendations include suggestions for online shopping, suggesting interesting web sites, or to find items such as movies or music. Generally for making a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keep tracking of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of  1 to 5 to denote the preferences of the people in the selected group.  To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. This is called Euclidean Distance score. Another way to calculate the similarity score is to find a correlation coefficient which is a measure of how well two sets of data fit on a straight line. By finding a best linear fit, we adjust for what is called the grade inflation which goes to say that some people may consistently grade higher than other while sharing similarity in their trends. Hence the latter scoring technique gives better results than the former.



Discovering Groups:

Data clustering is introduced in this chapter as a method for discovering and visualizing groups of things, people or ideas that are all closely related. The neural networks, decision trees and support vector machines and Bayesian filtering reviewed earlier were part of the supervised learning methods where the inputs and outputs are used to make predictions. Clustering on the other hand is an unsupervised learning method. Unlike a neural network or a decision tree, unsupervised learning algorithms are not trained with examples of correct answers. Instead we try to find the structure and make use of it.



 As an example, blogs can be clustered based on word frequencies which could be useful in searching, cataloging and discovering the huge number of blogs that are currently online. A feedparser module in python makes it easy to get the titles, links and entries from any RSS or Atom feed. With this module, its possible to extract a list of words from the title and the description and use that to count the words. With the wordcount (wc) and the number of blogs each word appeared in (apcount), we proceed to create a matrix of the word counts columns for each of the blogs as rows. These are called vectors.



 With this sample data, there are two clustering methods discussed in the book. The first is the hierarchical clustering and the second is the K-means clustering. The first one yields a dendrogram and the second yields k partitions. The former works by starting with many unit clusters and then building up a hierarchy of clusters by merging two closest clusters. When the clusters are merged they show which items are closer together and hence the tightness of the cluster. To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.  The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. It’s generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation.  It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions to the numerator by using the same variable instead of the other. As an aside, another way to measure similarity between two overlapping sets is to find the Tanimoto coefficient which is defined as the number of items in the intersection divided by the sum of the number of items in one of the sets and the number of items in the other set minus the number of items in the intersection of the two sets.



The K-means clustering on the other hand breaks the data into distinct groups instead of finding the structure as the hierarchical clustering does. Here the number of clusters to be generated is specified in advance by choosing randomly placed centroids. The centroids are updated to the average location of all the items assigned to them and the assignments change in each iteration and the iterations continue until there are no more changes.



When the clustered data is visualized, we use a technique called multidimensional scaling which will be used to find a two dimensional representation of the dataset. The algorithm takes the difference between every pair of items and makes a chart in which the distances between the items match those differences.



Searching and Ranking:

Searching and Ranking is another set of operations ubiquitous with collective intelligence programming. Searching is usually done with crawling whether it’s with a fixed set of documents or the corporate intranet. After the documents are collected, they are indexed.

Crawling or spidering involves starting with a small set of pages called the seed from which links are followed and each page visited is indexed while discovering more links to follow.

Adding to the index means that we separate the words in the text and for each entry we associate it with the URL. We also remove the stop words from the list. We check if the page has already been indexed. We also get an id for each of our entry. Finally, we keep track of the word locations.

Having looked at searching, we now looked at ranking. We retrieve the pages that match the queries.

The pages have not been scored when crawling.  Instead we find the score when we query. Scoring can be based on word frequency, document location, and word distance. As we encounter the words, we put them in a hash table to update the frequency and the location. Then as we walk the hash table, we can build a sorted dictionary where we score them based on the frequency

If we want to crawl all the URL links starting from a given page and back to the same page:

public static void GetAllPaths(Link current, Link target, ref candidate, ref List<Uri> results, // other suitable parameters)

{

if ( numberOfIterations > some Threshold) return;

var links = GetAllTheLinksOnTheCurrentPage(candidate);

if (link.Contains(target))

{

// add it to candidate

// add it to the results

// remove it from candidate

}

foreach (var link in links)

{

 // add it to candidate

 GetAllTheLinksOnTheCurrentPage(current, target, ref candidate, ref results);

// remove it from candidate

}

}



Document Filtering:

Classifying document using machine intelligence is fast becoming a common approach in applications such as the elimination of spam in e-mails, blogs, RSS feeds or message boards. Other applications include dividing your inbox into categories of social or work related e-mails. Early attempts were all rule-based classifiers. Rules were either authored as out of box or allowed to be authored by users.



However, classifiers operate on features that are used to classify different items. Features can simply be a set of words or phrases in a document. Determining which features to use is tricky and important. The entire text or even a character could be a feature. Learning the features is what is accomplished by training the classifier.  When classifiers are trained, they keep track of the count of features/category pairs, the number of times a feature has appeared in a category, the total number of items, and the list of all categories. By passing in a feature extraction parameter to the classifier such as one that returns words in a text and by training the classifier with an association of a feature with a category, the classifier can start accumulating the said counts and probabilities. The probabilities are usually calculated as conditional on the outcome.



The classifier is able to make predictions based only on the training data it has seen so far.  If the training data is not sufficient, we get around it by starting from an assumed probability for example 0.5. This can be weighted.  The weighted probability returns a weighted average of the calculated probability and the assumed probability.



Now that we have the probability of a document in a category for a given word/feature, we combine the probabilities of the features that make up the document to get the probability of the document belonging to that category. If the probabilities for different features are treated as independent, we call them a naïve Bayes classifier. Once the classifier is constructed, the final step in classifying a document is to calculate the probabilities of that document belonging to each category and then choosing the best.



The spam filtering does not use Bayes Classifier. It uses a method called Fisher method which is implemented by SpamBayes an outlook plugin. The difference is that this calculates the probability of a category for each feature in the document and then combines them to see if it’s a random set or its more or less likely to be in a category. In the Bayesian filter, we combined all the Pr(feature | category) results But there might be many more documents in one category than the other and this was ignored. Here we take advantage of the features that distinguish the categories. The Fisher method calculates:

 clf = Pr(feature | category) for this category,

freqsum = sum of Pr(feature | category) for all the categories

and then the cprob = clf / freqsum.

By this method a mail containing a word ‘casino’ will be treated as spam with a better probability than the Bayes classifier which can be useful in deciding thresholds and cutoff. When the probabilities are independent and random, the chi-square measure can be used as a measure of the goodness of the classification. Since an item belonging to a category may have many features of that category with high probability, the Fisher method calculates the inverse chi-square function to get the probability that a random set of probabilities would return such a higher number.



When there are many users training the classifier simultaneously, it is probably better to store the data and the counts in a database such as SQLite



Modeling decision trees:

As an example, a Bayesian classifier can help classify a sentence as language for a use of the word python as a language or as a snake. It works based on the probability P(category | Document) = P(Document | Category) * P(category)



where P(Document | Category) = P(word1 | category) * P(word2 | category) * ....



All we need is a feature extraction that turns the data we're using for training or classification into a list of features basically something that takes an object and returns a list such as a sentence converted into words. This classifier is trained on the strings. Then it can be used to predict the test data.



docclass.getwords('python is a dynamic language')



{'python' : 1, 'dynamic':1, 'language':1}



c1 = docclass.naivebayes(docclass.getwords)



c1.setdb('test.db')



c1.train('python has dynamic types', 'language')



c1.train('pythons are constrictors', 'snake')



c1.classify(''boa constrictors')



u'snake'



c1.classify('dynamic programming')



u'language'



A decision tree classifier builds a model to predict the classification based on a yes/no branch based on a node’s criteria. It checks each item in the list against the model and predicts a category. The decision trees are notable for being easy to read and interpret. Classifying in a decision tree is generally easy but training it is usually trickier. The divisions are usually based on entropy which is the amount of disorder in a set and is measured as follows:



P(i) = frequency(outcome) = count(outcome) / count(total rows)



Entropy = sum of p(i) * log(p(i)) for all outcomes



Low entropy tells us that the set is homogeneous. To determine the dividing variable, we find the information gain based on the entropy which is determined as follows:



Weight1 = size of subset1 / size of original set



Weight2 = size of subset2 / size of original set



Gain = entropy(original) – weight1*entropy(set1) – weight2*entropy(set2). Based on the dividing variable, the first node can be created. This can then be further divided to form a decision tree.



Neural Networks can also be used as a classifier. For example, a simple neural network can be used for altering the ranking of search results based on what link the users have clicked in the past. It works on the basis of giving a number to every link predicting that the link with the highest number would be the one that the user would click. The numbers are thus used to change the rankings. There are many different kinds of neural networks. As an example, there is a multilayer Perceptron network that is named because it has a layer of input neurons that feed into one or more layers of hidden neurons. The output of one layer is fed as input to the next layer. If there are three nodes in the first layer where the output of first is negative and the last is positive, and the middle influences the most for a given input, the classification will result from the middle. Usually a combination of the different layers will determine the end result.



Support Vector Machines are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.



A vector space model can also be used for categorization. We continue with Vector Space Model next. In this model, we represent documents of a collection by using vectors with weights according to the terms appearing in each document. The similarity between two documents d1 and d2 with respect to query q is used is measured as the cosine of the angle between the documents and the query. A document is said to be closer to the query if its cosine is smaller. Each query is modeled as a vector using the same attribute space as the documents. Groups of documents that are similar with respect to the users' information needs will be obtained by clustering, the documents representing a collection of points in a multidimensional space where the dimensions are given by the features selected There are different algorithms to clustering and almost all algorithms can cluster. Their fitness or quality is measured. Sometimes, it is better to reduce the number of dimensions without significant loss of information. This transformation to a subspace helps with clustering and reducing the computations Another technique uses a matrix for principal component analysis. In this method, the document and query methods are projected onto a different subspace spanned by k principal components. The matrix helps define the principal values. as for example a covariance matrix. Notice that reducing the number of dimensions does not shift the origin but only the number of dimensions, so the vectors are spaced apart the same. The principal component analysis on the other hand shifts the origin to the center of the basis vectors in the subspace so the documents are better differentiated.



Evolving Intelligence:

 Evolving intelligence for example is genetic programming which is a machine learning technique inspired by the theory of biological evolution. This starts with a large set of programs that are somewhat good solutions and then they are made to compete for a user defined task. The programs are ranked from best to worst, and some parts of the program are altered slightly aka mutated for making it better or some parts are replaced by those from other programs aka crossover or breeding. The new population is called a generation and is measured with a fitness function for the quality of the programs and the procedure is iterative.



 The worst programs are eliminated at each round. The iterations are terminated when the perfect solutions has been reached, a good enough solution has been found, the solution has not improved for several generations or the number of generations has reached a specified limit.



These programs are represented as Tree. Most programming languages are compiled into Parse Tree. These programs are similar to that. LISP programs are a way to express the Parse Tree directly. As an example of this program, we can have:



def func(x,y):



   if (x > 3):



         return y + 5



   else:



         return y + 3



In .Net, the equivalent for this would be an expression tree using System.Linq.Expressions



Expression<Func<int, bool>> lambda = x => x > 3;



 or as is clearer with :



 Expression e2 = Expression.GreaterThan(left, right);



 left = Expression.Constant(x, typeof(int));



 right = Expression.Constant(3, typeof(int));



This concludes the set of sections that I’ve reviewed from this book. We leave with the note that most of the methods mentioned above is as much applicable to the web as it is to a database or a personal computer.  As an example, data can be queried from search engines or web APIs just as effectively as from a corpus.

#coding exercise
Solve a set of linear equations using matrices:
We follow what is known as LUP decomposition:
L is a unit lower triangular matrix
U is an upper triangular matrix
P is a permutation matrix
LUP decomposition is finding three matrices such that PA = LU. Forward and backward substitutions solve L And U.
We solve Ax = b this way:
LUP-Solve(L,U,P,b)
{
int n= rows[L];
for (int I = 1; I <= n; I++)
   y = b(P[I]) - Sum j = 1 to I-1 li.yj
for (int I  = n ; I >= 1; I--)
   x = (yi - Sum j = I + 1 to n uij xi) / uii
return x;
}


Int GetSmallestPositiveNumberNotInArray (int [] arr)

{

If ( arr  == null) return 0;
Int min = -1;
Int max = -1;
Int smallest = min +1;

For ( Int I = 0; I < arr.length;  i++)

{
If (( min == -1  && arr [i] >= 0 ) || (min  != -1 && arr [i] >= 0 && arr [i] < min))
{
min = arr [i]; 
Smallest  = (min+1 > min) ? min + 1 : INT_MAX;
}
If (( max == -1  && arr [i] >= 0 ) || (max != -1 && arr [i] >= 0 && arr [i] > max))
{
max = arr [i];
}
}
Int start = (min+1 > min) ? min + 1 : INT_MAX;
Int end = (max+ 1 > max) ? max +1 : INT_MAX;
Bool found = false;
For ( Int j = start ; j <= end;  j++)
{
found=false;
For (Int  k=0; k < arr.length;  k++)
   {If (j == arr [k]) found = true; break ;}
If (found ==false)
    { 
        Smallest  = j;
        Break;
     }
If (smallest == 0 ) return smallest + 1;
If (smallest == INT_MAX && found) return 0;
Return  smallest;

}


No comments:

Post a Comment