Cluster computing

Saturday, September 27, 2014

Another problem and solution for LeetCode challenge :
Given n, generate all structurally unique BSTs that store value 1 .. n
The answer is said to be Catalan number = (2n)! / (n+1)! (n)!
How do we arrive at it ? Let's take a look:
Consider all possible Binary Search trees with each number as the root. For each of the n nodes as root, there are n-1 non-root nodes. We can partition these non-root nodes into the lower values than the root and the higher values than the root for the left and the right subtree. If we choose say i in a sorted list as the root node, then there are i-1 nodes in the left subtree and n-i in the right subtree.
Thus if we denote the number of possible BSTs with the function t(n), then we can write it as the
sum of i = 1 to n of the product t(i-1) t(n-1). We multiply because the arrangement of the left subtree is independent of the right subtree and for each such arrangement of the two sub-trees, we get one BST with the root as i. Also note that regardless of the values of the elements in the sorted list, the number of unique BSTs depends on the number of elements n.
Therefore we write the code as follows:
int numUniqueBSTs(int n)
{
int count = 0;
if (n == 0) return 1;
for (int i = 1; i <= n i++)
count += numUniqueBSTs(i-1) * numUniqueBSTs(n-i)
return count;
}
The iterative solution is also equally useful:
int numUniqueBSTs(int n)
{
int[] count = new int[n+1];
count[0] = 1;
for (I = 1; I <= n; I++)
for (int j = 0; j < I; j++)
count[I] += count[j] *count(I-j-1)
return count[n];
}

Okay one more coding exercise:
Given an array of positive integers with every element appearing twice except for one, find that one:
int OccursOnce(int[] num)
{
if (num == null || num.length <= 0) return -1;
int candidate = A[0];
for (int i = 1; i < num.length; i++)
{
candidate = candidate ^ num[i];
}
return candidate.
}

My next few posts will be on managing hybrid clouds, PAAS and OpenStack.

Here's another coding exercise:
Convert a sorted list to binary search tree.
Note that the in order traversal of a binary search tree is a sorted list
The caveat is that we can only interpret a balanced binary search tree from the given list.
123456789

     5
  3      8
2  4    7 9
1       6

Node SortedListToBST(List<int> num, int start, int end, Node root)
{

int mid;

if ( (start + end) %2 == 0)
     mid = (start + end ) / 2 + 1;
else
    mid = (start + end) / 2

var node = new Node();
node.data = num[mid];

if (root != null)
if (num[mid] < root.data)
     root.left = node;
else
    root.right = node

node.left = SortedListToBST(num, start, mid-1, node);

node.right = SortedListToBST(num, mid + 1, end, node);

return node;

};

Today we cover one more programming exercise question from the interview problems list:
Given a binary tree, print the zigzag level order.
The binary tree is given as 3, 9, 20, -1, -1, 15, 7 where -1 means null
and the tree is
3
9 20
15 7

We have to return the output as
[ (3),
(20, 9),
(7, 15),
]
public static List<List<int>> GetZigZag(Node root)
{
if (root == null) return null;
var output = new List<List<int>>();
var level = new List<int>();
var Q = new Queue();
Q.Enqueue(root);
Q.Enqueue(null);
while (Q.Count > 0)
{
var item = Q.Dequeue();
if (item == null)
{
output.Add(level);
if (Q.Count == 0) break;
level = new List<int>();
Q.Enqueue(null);
continue;
}
level.Add(item);
if (item.right) Q.Enqueue(item.right);
if (item.left) Q.Enqueue(item.left);
}
return output;
}

Friday, September 26, 2014

In the book “Programming Collective Intelligence” by OReilly Media, we use “Collaborative Filtering”, we review the chapter on discovering groups. Data clustering is introduced in this chapter as a method for discovering and visualizing groups of things, people or ideas that are all closely related. The neural networks, decision trees and support vector machines and Bayesian filtering reviewed earlier were part of the supervised learning methods where the inputs and outputs are used to make predictions. Clustering on the other hand is an unsupervised learning method. Unlike a neural network or a decision tree, unsupervised learning algorithms are not trained with examples of correct answers. Instead we try to find the structure and make use of it.

As an example, blogs can be clustered based on word frequencies which could be useful in searching, cataloging and discovering the huge number of blogs that are currently online. A feedparser module in python makes it easy to get the titles, links and entries from any RSS or Atom feed. With this module, its possible to extract a list of words from the title and the description and use that to count the words. With the wordcount (wc) and the number of blogs each word appeared in (apcount), we proceed to create a matrix of the word counts columns for each of the blogs as rows. These are called vectors.

With this sample data, there are two clustering methods discussed in the book. The first is the hierarchical clustering and the second is the K-means clustering. The first one yields a dendrogram and the second yields k partitions. The former works by starting with many unit clusters and then building up a hierarchy of clusters by merging two closest clusters. When the clusters are merged they show which items are closer together and hence the tightness of the cluster. To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result. The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. Its generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation. It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions by using the same variable instead of the other. As an aside, another way to measure similarity between two overlapping sets is to find the Tanimoto coefficient which is defined as the number of items in the intersection divided by the sum of the number of items in one of the sets and the number of items in the other set minus the number of items in the intersection of the two sets.

The K-means clustering on the other hand breaks the data into distinct groups instead of finding the structure as the hierarchical clustering does. Here the number of clusters to be generated is specified in advance by choosing randomly placed centroids. The centroids are updated to the average location of all the items assigned to them and the assignments change in each iteration and the iterations continue until there are no more changes.

When the clustered data is visualized, we use a technique called multidimensional scaling which will be used to find a two dimensional representation of the dataset. The algorithm takes the difference between every pair of items and makes a chart in which the distance between the items match those differences.

As I was walking out of a building today and stood at the curb waiting for a cab I had requested, a man pulled up and said I should not be waiting at the curb near the crosswalk because it delayed him. Really ? Just then a Platt driver crossed the road ...
Anyways reminded me to take a short break to do a coding exercise:
Given a sequence of integers find the max product of a contiguous subsequence
If (input == null || input.length <=0) throw new Exception();
int product = input[0];
Int max = product > 0 ? product: max;
For ( int I = 1; I < input.Length; i++)
If product × input [i] > product
Product = product × input [¡]
Else
Product = product[I] > 0 ? product[I] : 1;

If (product > max)
Max = product;

}
Return max;

Thursday, September 25, 2014

Making recommendations is yet another application of collective intelligence as described in the book mentioned in the previous posts. Recommendations include suggestions for online shopping, suggesting interesting web sites, or to find items such as movies or music. Generally for making a recommendation, first a group sharing similar taste is found and then the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keep tracking of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group. To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. This is called Euclidean Distance score. Another way to calculate the similarity score is to find a correlation coefficient which is a measure of how well two sets of data fit on a straight line. By finding a best linear fit, we adjust for what is called the grade inflation which goes to say that some people may consistently grade higher than other while sharing similarity in their trends. Hence the latter scoring technique gives better results than the former.
Searching and Ranking is another set of operations ubiquitous with collective intelligence programming. Searching is usually done with crawling whether its with a fixed set of documents or the corporate intranet. After the documents are collected, they are indexed.
Crawling or spidering involves starting with a small set of pages called the seed from which links are followed and each page visited is indexed while discovering more links to follow.
Adding to the index means that we separate the words in the text and for each entry we associate it with the url. we also remove the stop words from the list. We check if the page has already been indexed. We also get an id for each of our entry. Finally, we keep track of the word locations.
Having looked at searching, we now looked at ranking. We retrieve the pages that match the queries.
The pages have not been scored when crawling. Instead we find the score when we query. Scoring can be based on word frequency, document location, and word distance. As we encounter the words, we put them in a hash table to update the frequency and the location. Then as we walk the hash table, we can build a sorted dictionary where we score them based on the frequency

If we want to crawl all the url links starting from a given page and back to the same page:
public static void GetAllPaths(Link current, Link target, ref candidate, ref List<Uri> results, // other suitable parameters)
{
if ( numberOfIterations > some Threshold) return;
var links = GetAllTheLinksOnTheCurrentPage(candidate);
if (link.Contains(target))
{
// add it to candidate
// add it to the results
// remove it from candidate
}

foreach (var link in links)
{
// add it to candidate
GetAllTheLinksOnTheCurrentPage(current, target, ref candidate, ref results);
// remove it from candidate
}

}

Wednesday, September 24, 2014

We mentioned that the book on programming collective intelligence mentions making recommendations, discovering groups, searching and ranking, optimization, document filtering, modeling with decision trees, building price models and advanced classification and finding independent features. In today's post, I will briefly outline these.
Evolving intelligence for example is genetic programming which is a machine learning technique inspired by the theory of biological evolution. This starts with a large set of programs that are somewhat good solutions and then they are made to compete for a user defined task. The programs are ranked from best to worst, and some parts of the program are altered slightly aka mutated for making it better or some parts are replaced by those from other programs aka crossover or breeding. The new population is called a generation and is measured with a fitness function for the quality of the programs and the procedure is iterative.
The worst programs are eliminated at each round. The iterations are terminated when the perfect solutions has been reached, a good enough solution has been found, the solution has not improved for several generations or the number of generations has reached a specified limit.
These programs are represented as Tree. Most programming languages are compiled into Parse Tree. These programs are similar to that. LISP programs are a way to express the Parse Tree directly. As an example of this program, we can have:
def func(x,y):
   if (x > 3):
         return y + 5
   else:
         return y + 3
In .Net, the equivalent for this would be an expression tree using System.Linq.Expressions
Expression<Func<int, bool>> lambda = x => x > 3;
or as is clearer with :
Expression e2 = Expression.GreaterThan(left, right);
left = Expression.Constant(x, typeof(int));
right = Expression.Constant(3, typeof(int));

I'm going to try out a coding exercise:
Given a mapping of characters as A = 1, Z = 26 and given a number, how many different ways can we decode it ? For eg: 12 can be AB or L.

NoSQL Databases and scalability.

Traditional databases such as SQL Server have a rich history in scalability and performance considerations for user data storage such as horizontal and vertical table and index partitioning, data compression, resource governor, IO resource governance, partition table parallelism, multiple file stream containers, NUMA aware large page memory and buffer array allocation, buffer pool extension, In-memory OLTP, delayed durability etc. In NoSQL databases, the unit of storage is the key-value collection and each row can have different number of columns from a column family. The option for performance and scalability has been to use sharding and partitions. Key Value pairs are organized according to the key. Keys in turn are assigned to a partition. Once a key is assigned to a partition, it cannot be moved to a different partition. Since it is not configurable, the number of partitions in a store is decided upfront. The number of storage nodes in use by the store can however be changed. When this happens, the store undergoes reconfiguration, the partitions are balanced between new and old shards, redistribution of partition between one shard and another takes place. The more the number of partitions the more granularities there are for the reconfiguration. It is typical to have ten to twenty partitions per shard. Since the number of partitions cannot be changed, this is often a design consideration.

The number of nodes belonging to a shard called its replication factor improves the throughput. The higher the replication factor, the faster the read because of the availability but the same is not true for writes since there is more copying involved. Once the replication factor is set, the database takes care of creating the appropriate number of replication nodes for each shard.

A topology is a collection of storage nodes, replication nodes, and the associated services. At any point of time, a deployed store has one topology. The initial topology is such that it minimizes the possibility of a single point of failure for any given shard. If the storage node hosts more than one replication node, those replication nodes will not be from the same shard. If the host machine goes down, the shard can continue for reads and writes.

Performance tuning on the NoSQL databases is an iterative process. Topologies, replication factor, shards and partitions can be changed to improve performance or to adjust for the storage nodes.

Courtesy:

Microsoft and Oracle documentations.