Friday, September 26, 2014


In the book “Programming Collective Intelligence” by OReilly Media, we use “Collaborative Filtering”, we review the chapter on discovering groups.  Data clustering is introduced in this chapter as a method for discovering and visualizing groups of things, people or ideas that are all closely related. The neural networks, decision trees and support vector machines and Bayesian filtering reviewed earlier were part of the supervised learning methods where the inputs and outputs are used to make predictions. Clustering on the other hand is an unsupervised learning method. Unlike a neural network or a decision tree, unsupervised learning algorithms are not trained with examples of correct answers. Instead we try to find the structure and make use of it.

As an example, blogs can be clustered based on word frequencies which could be useful in searching, cataloging and discovering the huge number of blogs that are currently online. A feedparser module in python makes it easy to get the titles, links and entries from any RSS or Atom feed. With this module, its possible to extract a list of words from the title and the description and use that to count the words. With the wordcount (wc) and the number of blogs each word appeared in (apcount), we proceed to create a matrix of the word counts columns for each of the blogs as rows. These are called vectors.

With this sample data, there are two clustering methods discussed in the book. The first is the hierarchical clustering and the second is the K-means clustering. The first one yields a dendrogram and the second yields k partitions. The former works by starting with many unit clusters and then building up a hierarchy of clusters by merging two closest clusters. When the clusters are merged they show which items are closer together and hence the tightness of the cluster. To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.  The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. Its generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation.  It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums  and this is divided by the square-root of the product of the squares of the substitutions by using the same variable instead of the other. As an aside, another way to measure similarity between two overlapping sets is to find the Tanimoto coefficient which is defined as the number of items in the intersection divided by the sum of the number of items in one of the sets and the number of items in the other set minus the number of items in the intersection of the two sets.

The K-means clustering on the other hand breaks the data into distinct groups instead of finding the structure as the hierarchical clustering does. Here the number of clusters to be generated is specified in advance by choosing randomly placed centroids. The centroids are updated to the average location of all the items assigned to them and the assignments change in each iteration and the iterations continue until there are no more changes.

When the clustered data is visualized, we use a technique called multidimensional scaling which will be used to find a two dimensional representation of the dataset. The algorithm takes the difference between every pair of items and makes a chart in which the distance between the items match those differences.
As I was walking out of a building today and stood at the curb waiting for a cab I had requested,  a man pulled up and said I should not be waiting at the curb near the crosswalk because it delayed him. Really ?  Just then a Platt driver crossed the road ...
Anyways reminded me to take a short break to do a coding exercise:
Given a sequence of integers find the max product of a contiguous  subsequence
If (input == null || input.length <=0) throw new Exception();
int product = input[0];
Int max = product > 0 ? product: max;
For ( int I = 1; I  <  input.Length; i++)
   If product  × input [i] > product
       Product = product × input [¡]
   Else
       Product = product[I] > 0 ? product[I] : 1;

    If (product > max)
         Max = product;

}
Return max;


No comments:

Post a Comment