Thursday, September 25, 2014

Making recommendations is yet another application of collective intelligence as described in the book mentioned in the previous posts. Recommendations include suggestions for online shopping, suggesting interesting web sites, or to find items such as movies or music. Generally for making a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keep tracking of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of  1 to 5 to denote the preferences of the people in the selected group.  To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. This is called Euclidean Distance score. Another way to calculate the similarity score is to find a correlation coefficient which is a measure of how well two sets of data fit on a straight line. By finding a best linear fit, we adjust for what is called the grade inflation which goes to say that some people may consistently grade higher than other while sharing similarity in their trends. Hence the latter scoring technique gives better results than the former.
Searching and Ranking is another set of operations ubiquitous with collective intelligence programming. Searching is usually done with crawling whether its with a fixed set of documents or the corporate intranet. After the documents are collected, they are indexed.
Crawling or spidering involves starting with a small set of pages called the seed from which links are followed and each page visited is indexed while discovering more links to follow.
Adding to the index means that we separate the words in the text and for each entry we associate it with the url. we also remove the stop words  from the list. We check if the page has already been indexed. We also get an id for each of our entry. Finally, we keep track of the word locations.
Having looked at searching, we now looked at ranking. We retrieve the pages that match the queries.
The pages have not been scored when crawling.  Instead we find the score when we query. Scoring can be based on word frequency, document location, and word distance. As we encounter the words, we put them in a hash table to update the frequency and the location. Then as we walk the hash table, we can build a sorted dictionary where we score them based on the frequency

If we want to crawl all the url links starting from a given page and back to the same page:
public static void GetAllPaths(Link current, Link target, ref candidate, ref List<Uri> results, // other suitable parameters)
{
if ( numberOfIterations > some Threshold) return;
var links = GetAllTheLinksOnTheCurrentPage(candidate);
if (link.Contains(target))
{
// add it to candidate
// add it to the results
// remove it from candidate
}

foreach (var link in links)
{
 // add it to candidate
 GetAllTheLinksOnTheCurrentPage(current, target, ref candidate, ref results);
// remove it from candidate
}

}

No comments:

Post a Comment