In the book “Programming Collective Intelligence” by OReilly
Media, we use “Collaborative Filtering”, we review the chapter on discovering
groups. Data clustering is introduced in
this chapter as a method for discovering and visualizing groups of things,
people or ideas that are all closely related. The neural networks, decision
trees and support vector machines and Bayesian filtering reviewed earlier were
part of the supervised learning methods where the inputs and outputs are used
to make predictions. Clustering on the other hand is an unsupervised learning
method. Unlike a neural network or a decision tree, unsupervised learning algorithms
are not trained with examples of correct answers. Instead we try to find the
structure and make use of it.
As an example, blogs can be clustered based on word
frequencies which could be useful in searching, cataloging and discovering the
huge number of blogs that are currently online. A feedparser module in python
makes it easy to get the titles, links and entries from any RSS or Atom feed. With
this module, its possible to extract a list of words from the title and the description
and use that to count the words. With the wordcount (wc) and the number of
blogs each word appeared in (apcount), we proceed to create a matrix of the
word counts columns for each of the blogs as rows. These are called vectors.
With this sample data, there are two clustering methods
discussed in the book. The first is the hierarchical clustering and the second
is the K-means clustering. The first one yields a dendrogram and the second
yields k partitions. The former works by starting with many unit clusters and
then building up a hierarchy of clusters by merging two closest clusters. When
the clusters are merged they show which items are closer together and hence the
tightness of the cluster. To determine the closeness a couple of mathematical
formula help. In this case, we could use the Euclidean distance or the Pearson
co-efficient. The Euclidean distance finds the distance between two points in a
multidimensional space by taking the sum of the square of the differences between
the coordinates of the points and then calculating the square root of the
result. The Pearson correlation
co-efficient is a measure of how highly correlated the two variables are. Its
generally a value between -1 and 1 where -1 means that there is a perfect
inverse correlation and 1 means there is a perfect correlation while 0 means there
is no correlation. It is computed with
the numerator as the sum of the two variables taken together minus the average of
their individual sums and this is
divided by the square-root of the product of the squares of the substitutions
by using the same variable instead of the other. As an aside, another way to
measure similarity between two overlapping sets is to find the Tanimoto
coefficient which is defined as the number of items in the intersection divided
by the sum of the number of items in one of the sets and the number of items in
the other set minus the number of items in the intersection of the two sets.
The K-means clustering on the other hand breaks the data
into distinct groups instead of finding the structure as the hierarchical
clustering does. Here the number of clusters to be generated is specified in
advance by choosing randomly placed centroids. The centroids are updated to the
average location of all the items assigned to them and the assignments change
in each iteration and the iterations continue until there are no more changes.
When the clustered data is visualized, we use a technique
called multidimensional scaling which will be used to find a two dimensional
representation of the dataset. The algorithm takes the difference between every
pair of items and makes a chart in which the distance between the items match
those differences.
As I was walking out of a building today and stood at the curb waiting for a cab I had requested, a man pulled up and said I should not be waiting at the curb near the crosswalk because it delayed him. Really ? Just then a Platt driver crossed the road ...
Anyways reminded me to take a short break to do a coding exercise:
Given a sequence of integers find the max product of a contiguous subsequence
If (input == null || input.length <=0) throw new Exception();
int product = input[0];
Int max = product > 0 ? product: max;
For ( int I = 1; I < input.Length; i++)
If product × input [i] > product
Product = product × input [¡]
Else
Product = product[I] > 0 ? product[I] : 1;
If (product > max)
Max = product;
}
Return max;
Anyways reminded me to take a short break to do a coding exercise:
Given a sequence of integers find the max product of a contiguous subsequence
If (input == null || input.length <=0) throw new Exception();
int product = input[0];
Int max = product > 0 ? product: max;
For ( int I = 1; I < input.Length; i++)
If product × input [i] > product
Product = product × input [¡]
Else
Product = product[I] > 0 ? product[I] : 1;
If (product > max)
Max = product;
}
Return max;
No comments:
Post a Comment