Cluster computing

Wednesday, September 24, 2014

We mentioned that the book on programming collective intelligence mentions making recommendations, discovering groups, searching and ranking, optimization, document filtering, modeling with decision trees, building price models and advanced classification and finding independent features. In today's post, I will briefly outline these.
Evolving intelligence for example is genetic programming which is a machine learning technique inspired by the theory of biological evolution. This starts with a large set of programs that are somewhat good solutions and then they are made to compete for a user defined task. The programs are ranked from best to worst, and some parts of the program are altered slightly aka mutated for making it better or some parts are replaced by those from other programs aka crossover or breeding. The new population is called a generation and is measured with a fitness function for the quality of the programs and the procedure is iterative.
The worst programs are eliminated at each round. The iterations are terminated when the perfect solutions has been reached, a good enough solution has been found, the solution has not improved for several generations or the number of generations has reached a specified limit.
These programs are represented as Tree. Most programming languages are compiled into Parse Tree. These programs are similar to that. LISP programs are a way to express the Parse Tree directly. As an example of this program, we can have:
def func(x,y):
   if (x > 3):
         return y + 5
   else:
         return y + 3
In .Net, the equivalent for this would be an expression tree using System.Linq.Expressions
Expression<Func<int, bool>> lambda = x => x > 3;
or as is clearer with :
Expression e2 = Expression.GreaterThan(left, right);
left = Expression.Constant(x, typeof(int));
right = Expression.Constant(3, typeof(int));

I'm going to try out a coding exercise:
Given a mapping of characters as A = 1, Z = 26 and given a number, how many different ways can we decode it ? For eg: 12 can be AB or L.

NoSQL Databases and scalability.

Traditional databases such as SQL Server have a rich history in scalability and performance considerations for user data storage such as horizontal and vertical table and index partitioning, data compression, resource governor, IO resource governance, partition table parallelism, multiple file stream containers, NUMA aware large page memory and buffer array allocation, buffer pool extension, In-memory OLTP, delayed durability etc. In NoSQL databases, the unit of storage is the key-value collection and each row can have different number of columns from a column family. The option for performance and scalability has been to use sharding and partitions. Key Value pairs are organized according to the key. Keys in turn are assigned to a partition. Once a key is assigned to a partition, it cannot be moved to a different partition. Since it is not configurable, the number of partitions in a store is decided upfront. The number of storage nodes in use by the store can however be changed. When this happens, the store undergoes reconfiguration, the partitions are balanced between new and old shards, redistribution of partition between one shard and another takes place. The more the number of partitions the more granularities there are for the reconfiguration. It is typical to have ten to twenty partitions per shard. Since the number of partitions cannot be changed, this is often a design consideration.

The number of nodes belonging to a shard called its replication factor improves the throughput. The higher the replication factor, the faster the read because of the availability but the same is not true for writes since there is more copying involved. Once the replication factor is set, the database takes care of creating the appropriate number of replication nodes for each shard.

A topology is a collection of storage nodes, replication nodes, and the associated services. At any point of time, a deployed store has one topology. The initial topology is such that it minimizes the possibility of a single point of failure for any given shard. If the storage node hosts more than one replication node, those replication nodes will not be from the same shard. If the host machine goes down, the shard can continue for reads and writes.

Performance tuning on the NoSQL databases is an iterative process. Topologies, replication factor, shards and partitions can be changed to improve performance or to adjust for the storage nodes.

Courtesy:

Microsoft and Oracle documentations.

Monday, September 22, 2014

A decision tree classifier builds a model to predict the classification based on a yes/no branch based on a node’s criteria. It checks each item in the list against the model and predicts a category. The decision trees are notable for being easy to read and interpret. Classifying in a decision tree is generally easy but training it is usually trickier. The divisions are usually based on entropy which is the amount of disorder in a set and is measured as follows:

P(i) = frequency(outcome) = count(outcome) / count(total rows)

Entropy = sum of p(i) * log(p(i)) for all outcomes

A low entropy tells us that the set is homogeneous. To determine the dividing variable, we find the information gain based on the entropy which is determined as follows:

Weight1 = size of subset1 / size of original set

Weight2 = size of subset2 / size of original set

Gain = entropy(original) – weight1*entropy(set1) – weight2*entropy(set2). Based on the dividing variable, the first node can be created. This can then be further divided to form a decision tree.

Neural Networks can also be used as a classifier. For example, a simple neural network can be used for altering the ranking of search results based on what link the users have clicked in the past. It works on the basis of giving a number to every link predicting that the link with the highest number would be the one that the user would click. The numbers are thus used to change the rankings. There are many different kinds of neural networks. As an example, there is a multilayer perceptron network that is named because it has a layer of input neurons that feed into one or more layers of hidden neurons. The output of one layer is fed as input to the next layer. If there are three nodes in the first layer where the output of first is negative and the last is positive, and the middle influences the most for a given input, the classification will result from the middle. Usually a combination of the different layers will determine the end result.

Support Vector Machines are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.

I take a brief break to discuss a coding exercise for Gray code.
Gray code is also known as reflected binary code since the 0 and 1 sequence in a bit position is reflected during single bit changes between numbers leading up to the given number.
To convert to gray code, we write the number in its binary notation first. Say 9 is 1001.
the digits d1, d2, ... dn. If the dn-1 is 1 then substitute dn with 1-dn and proceed forward to dn-1 otherwise leave it unchanged and proceed forward. The resulting number is the binary reflected Gray code. 9's gray code is 1101.
The reverse conversion from a Gray code (g1, g2, .. gn-1) to the binary notation for a number is done by calculating
Sum(n) = (Sum 1 <= i <= n-1 (gi)) (mod 2)
If this computes as 1, then replace gn by 1-gn, otherwise leave it unchanged.
public static string GrayCode(string number)
{
char[] binary = number.ToCharArray();
for (int i = binary.Length - 1; i > 0; i--)
{
if (binary[i - 1] == '1')
{
binary[i] = binary[i] == '1' ? '0' : '1'; // set 1-x
}
}
return new String(binary);
}

Another coding question is for buy/sell of stocks. and max profit. The trick here is that a stock must be bought before it is sold. and this can be repeated as many times just that the lookahead is for the number of days ahead of the buying date.
        public static int Profit(int[] prices)
        {
            if (prices == null || prices.Length <= 0) return 0;
            int globalProfit = 0;
            int profit = 0;
            for (int i = 1; i < prices.Length; i++)
            {
                if (prices[i] > prices[i - 1])
                    profit += prices[i] - prices[i - 1];
                else
                {
                    globalProfit += profit;
                    profit = 0;
                }
            }
            globalProfit += profit;
            return globalProfit;

        }

            var prices1 = new int [] { 3, 6, 9 };
            var prices2 = new int [] { 9, 6, 3 };
            var prices3 = new int [] { 3, 9, 6 };
            var prices4 = new int [] { 6, 9, 3 };
            var prices5 = new int [] { 6, 3, 9 };
            var prices6 = new int [] { 9, 3, 6 };
            var prices7 = new int [] { 9, 9, 9 };

            Console.WriteLine("expected = {0}, actual = {1}", 6, Profit(prices1));
            Console.WriteLine("expected = {0}, actual = {1}", 0, Profit(prices2));
            Console.WriteLine("expected = {0}, actual = {1}", 6, Profit(prices3));
            Console.WriteLine("expected = {0}, actual = {1}", 3, Profit(prices4));
            Console.WriteLine("expected = {0}, actual = {1}", 6, Profit(prices5));
            Console.WriteLine("expected = {0}, actual = {1}", 3, Profit(prices6));
            Console.WriteLine("expected = {0}, actual = {1}", 0, Profit(prices7));
            var prices = new int[] { 5, 4, 3, 2, 1, 8, 5, 9, 11 };
            Console.WriteLine("{0}", Profit(prices));
expected = 6, actual = 6
expected = 0, actual = 0
expected = 6, actual = 6
expected = 3, actual = 3
expected = 6, actual = 6
expected = 3, actual = 3
expected = 0, actual = 0
13

Combination of string:
public static void Combine(ref List<char> input, ref List<char> candidate, ref List<List<char>> sequences, int level, int start)
{
for (int i = start; i < input.Count; i++)
{
if (candidate.Contains(input[i]) == false)
{
candidate[level] = input[i];
if (candidate.Count == input.Count)
{
sequences.Add(candidate);
Console.WriteLine("{0}", new string(candidate.ToArray()));
}
if (i < input.Count - 1)
Combine(ref input, ref candidate, ref sequences, level + 1, start + 1);
candidate[level] = '\0';
}
}
}

Note that permutations is also recursive.

Here is another interview question.
void PrintPrime(int countOfPrimes)
{
int i = 2;
int count = 0;
while (count <countOfPrimes)
{
if (isPrime(i))
{
Console.WriteLine("{0}", i);
count++;
}
i++;
if (i == INT_MAX) break;
}
}

void isPrime(int x)
{
for (int i = 2; i < x; i++)
if (x % i == 0) return false;
return true;
}

Sunday, September 21, 2014

We will start reviewing a book titled Programming Collective Intelligence from O'Reilly Media in the subsequent blog posts. This book includes such things as making recommendations, discovering groups, searching and ranking, optimization, document filtering, modeling with decision trees, building price models, advanced classification - Kernel methods and SVMs, finding independent features, evolving intelligence and a summary of algorithms. The sample code is presented in Python for readability.
As an example, a Bayesian classifier can help classify a sentence as language for a use of the word python or as a snake for the use of the same word. It works based on the probability P(category | Document) = P(Document | Category) * P(category)
where P(Document | Category) = P(word1 | category) * P(word2 | category) * ....
All we need is a feature extraction that turns the data we're using for training or classification into a list of features basically something that takes an object and returns a list such as a sentence converted into words. This classifier is trained on the strings. Then it can be used to predict the test data.
docclass.getwords('python is a dynamic language')
{'python' : 1, 'dynamic':1, 'language':1}
c1 = docclass.naivebayes(docclass.getwords)
c1.setdb('test.db')
c1.train('python has dynamic types', 'language')
c1.train('pythons are constrictors', 'snake')
c1.classify(''boa constrictors')
u'snake'
c1.classify('dynamic programming')
u'language'

I have to take a short break for a coding problem and including it in this post before continuing.
How many different distinct subsequences exist for a string T in string S and what is the most efficient way to find it. For example rabbit appears in rabbbit in three different ways. For every match until the length of the given pattern, we explore the remaining of the given input to see if there's a full match by skipping non-matching and repetititive characters. When the full match is found, we increment a counter. Another way to do this would be to use an automata or Rabin Karp method or KMP method.

void KMP(string pattern, string text, vector<int> *positions) {
    int patternLength = pattern.length();
    int textLength = text.length();
    int* next = PreProcess(pattern);
if (next == 0) return;
    int i = 0;
    int j = 0;
    while ( j < textLength )
{
  while(true)
   if (text[j] == pattern[i]) //matches
   {
    i++;   // yes, move on to the next state
    if (i == patternLength) // maybe that was the last state
    {
     // found a match;
     positions->push_back(j-(i-1));
     i = next[i];
    }
    break;
   }
   else if (i == 0) break; // no match in state j = 0, give up
   else i = next[i];
  j++;
}
}

int* PreProcess( string pattern) {
int patternLength = pattern.length();
if (patternLength == 0) return 0;
    int * next = new int[patternLength + 1];
    if (next == 0) return 0;
    next[0] = -1;  // set up for loop below; unused by KMP

    int i = 0;
    int j = -1;
    // next[0] = -1;
    // int len = pattern.length();
    while (i < patternLength) {
  next[i + 1] = next[i] + 1;
  while ( next[i+1] > 0 &&
    pattern[i] != pattern[next[i + 1] - 1])
   next[i + 1] = next[next[i + 1] - 1] + 1;
  i++;
}
    return next;

}

Saturday, September 20, 2014

Clojure is a general purpose language that targets the Java Virtual Machine, CLR and JavaScript. We briefly mentioned that it compiles to Java Byte Code. The ClojureCLR port which is written in C# enables it to target the CLR as well. Clojure is a dialect of Lisp which enables query language. It offers a transactional memory system and a reactive Agent system that ensures correct multithreaded designs. What sets Clojure apart is the immutable data and the first class functions. It helps to see Clojure language as a platform.
Features include
1) dynamic development - the Read Eval Print Loop (REPL) provides a console interface to try out Clojure commands. Clojure programs need not be compiled and run as in this case.
2) Functional Programming - fn and defn provide ways to define first class functions. Clojure provides a set of immutable lists, vectors, sets and maps where the older version is persisted and the newer versions are not a full copy.
3) Lisp Clojure is a dialect of Lisp for example 'and' is a macro
4) Runtime polymorphism 'proxy helps support implementation of interfaces
5) Concurrent programming shared objects are changed threadsafe with 'dosync', 'ref', 'set' and 'alter'. 'def' and 'binding' support isolating changing state within threads.
6) Hosted on the JVM - compiles to JVM bytecode and supports Java interfaces and classes using 'reify' and 'proxy'.
Stackoverflow posts gives the following Clojure code for factorial:
(defn fact [x]
   (loop [n x f 1]
         ( if ( = n 1)
                  f
                  (recur (dec n) (* f n)))))

// HttpHeaderReader.c : Reads HTTP header and counts them
//
#include "stdafx.h"
#include "stdio.h"
#include "string.h"
static int Hash (const char * pSrc, size_t len);
int main(int argc, char* argv[])
{
static const char filename[] = "foo.txt";
static const int MAX_COUNT = 255;
static int frequency[MAX_COUNT] = {0};
FILE* fd = fopen(filename, "r");
if ( fd != NULL )
{
  char line[MAX_COUNT + 1];
  while ( fgets(line, sizeof line, fd) != NULL )
  {
   char* p = strchr(line, ':');
   if (p != NULL)
   {
    frequency[Hash(_strlwr(line), (size_t)(p-line))%MAX_COUNT] += 1;
   }
  }
  fclose(fd);
  char header[MAX_COUNT + 1] = {0};
  printf( "Enter the header:\n " );
  scanf("%255s", header);
  header[MAX_COUNT] = '\0';
  printf( "%s occurs %d times.\n", header, frequency[Hash(_strlwr(header), strlen(_strlwr(header)))%MAX_COUNT] );
}
return 0;
}
static int Hash (const char * pSrc, size_t len)
{
size_t res = 0;
while (len--) res = (res << 1) ^ *pSrc++;
return (int)res;
}

content-length: 10
len = 14, hash = 177
user-agent: test
len = 10, hash = 107
content-length: 14
len = 14, hash = 177
accept: comedy
len = 6, hash = 16
content-length: 100
len = 14, hash = 177
content-encoding: gzip
len = 16, hash = 134
connection: close
len = 10, hash = 50
user-agent: test
len = 10, hash = 107
accept: flash
len = 6, hash = 16
user-agent: test1
len = 10, hash = 107
content-length: 20
len = 14, hash = 177
user-agent: test2
len = 10, hash = 107
user-agent: test3
len = 10, hash = 107
accept: gzip
len = 6, hash = 16
Enter the header:
user-aGENT
user-agent occurs 5 times.

Friday, September 19, 2014

Today we will review Node.Js. We talked about Node.js, Connect and Express in earlier discussions. We saw that the calls were asynchronous and that we could add Backbone for MVC on the web server for the user interface. Express supports both production and development settings. Changes made require restart to the application and a node supervisor helps in this regard.
Sample server code to look up a phonebook works like this:
Server.js :
var express = require('express');
var app = express();
app.set('view engine', 'ejs');
app.set('view options', { layout : false });
app.use(express.bodyParser());
app.use(app.router);
app.post('/search', function (req, res) {
var result = words.search(req.body.pattern).result;
res.render('result', {words: result, pattern: req.body.pattern });
});
app.listen(process.env.PORT || config.port);

function search(word) {
// parameter validation and sanitization
var toSearch = word;
var result = _.filter(dictionary, function(w) {
var index = binarySearch(dictionary,w);
return index == -1 ? null ; dictionary[index];
});
return {result :result};
}
}

function binarySearch(dictionary, value)
{
int start = 0;
int end = dictionary.length - 1;

while (start <= end)
{
int mid = (start + end) >>> 1
var midword = dictionary[mid];

if (midword < value)
start = mid + 1
else if (midword > value)
end = mid -1
else
return mid;
}
return -(start+1);
}

We will next cover Clojure in this post which provides easy access to Java framework. Closure compiles to Java byte code.