Cluster computing

Sunday, November 8, 2015

Find sum of all numbers that are formed from root to leaf path (code) expected time complexity O(n)
void DFSSearch(Node root, ref List<List<Int>> listOfRootToLeafPaths, ref List<int> curList)
{
if (root != null)
{
if (curList == null) curList = new List<int>();
curList.Add(root.value);
If (root.left == null && root.right==null){
listOfRootToLeafPaths.Add(curList);
}
DFSSearch(root.left, ref listOfRootToLeafPaths, ref curList);
DFSSearch(root.right, ref listOfRootToLeafPaths, ref curList);
curList.RemoveAt(curList.Length - 1);
}
}

Combinations of a string
Void Combine (List<int> a, List<int> b, int start, int level, ref int Sum)
{

For (int I = start ; I < a.Length; i++)

{

b[level] = a[i];

Sum += b.ToNumber(); // digits multiplied by place order of tens

If (I < a.Length)

Combine(a,b, start+1,level+1)

B.RemoveAt(B.Length - 1);

}

Each combination can also have different permutations that can be added to Sum

Void Permute (List<int> a, List<int> b, bool[] used, ref Sum)

{

If ( b.Length == a.Length) { Sum += b.ToNumber(); return;}

For (int I = 0; I < a.Length ; i++)

{

If (used[i]) continue;

used[i] = true;

B += A[i];

Permute(a, b, used, ref Sum);

B.RemoveAt(B.Length - 1);

used[i] = false;

}

Saturday, November 7, 2015

Machine Learning tools on NoSQL databases

MapReduce is inherently suited for operations on BigData. And Machine Learning statistical methods love more and more data. Naturally a programmer will look to implementing such methods on BigData using MapReduce.

What is Map-Reduce ?

MapReduce is an arrangement of tasks that enable relatively easy scaling. It include:

Hardware arrangement – nodes as part of cluster that can communicate and process parallel

File System – that provides distributed storage across multiple disks

Software processes that run on various cpu in the assembly.

Controller – that manages mapper and reducer tasks

Mapper – that assigns identical tasks to multiple cpus for each to run over its local data

Reducer – aggregates output from several mappers to form end product

Mappers emit a key-value pair

Controllers sort key-value pairs by key

Reducers get pairs grouped by key

The map-reduce is the only task that the programmer will need to write and the rest of the chores are handled by the platform.

Map-reduce can be written in python or javascript depending on the NoSQL database and associated technology. For example , the following databases support Map-Reduce operations:

MongoDB – this provides a mapReduce database command

Riak – This comes with an Erlang shell that can provide this functionality

CouchDB – this comes with document identifiers that can be used with simple Javascript functions to MapReduce

Machine Learning methods that involve statistics usually have a summation over some expressions or term components. For example, a least squares regression requires to compute the sum of the squared residuals and tries to keep it minimum.

This suits map reduce very well because the mappers can find the partial sums of squared residuals while the reducer can merely aggregate the partial sums into totals to complete the calculation. Many algorithms can be arranged in Statistical Query Model form. In some cases, iteration is required. Each iterative step involves a map-reduce sequence.

Let us now look at clustering techniques and how to fit it on map-reduce

For K-means clustering that proceeds this way:

Initialize :

Pick K starting guesses for Centroid at random

Iterate:

Assign points to cluster whose centroid is closest

Calculate cluster centroids

The corresponding map-reduce is as follows:

Mapper - run through local data and for each point, determine closest centroid. Accumulate the vector sum of points closest to each centroid. Combiner emits centroid as key and tuple of partial sum of set and set index

Reducer - for each old centroid, aggregate sum and n from all mappers and calculate new centroid.

Courtesy : Michael Bowles PhD introduction to BigData

Thursday, November 5, 2015

Today we continue the discussion on NoSQL Databases. We have HBase which is written in Java. It can store billions of rows times millions of columns and is modeled after Google's BigTable. It has an Apache License. Like CouchDB, it also supports HTTP/REST. HBase is built on a framework of data storage and data processing services.The former is named HDFS, short for Hadoop Distributed File System. The latter is named MapReduce and uses a high performance parallel data processing technique. A data warehouse named Hive, and a query language named Pig is also available over this framework. Together this forms the Hadoop framework. Hbase stores key-value as columns in a column family and each row can have more than one column family. Each row need not have the same number of columns. Query predicates can be pushed down via server side scan and get filters. Optimizations are for realtime queries. Hadoop scales horizontally in that additional commodity hardware can be added without interruption and node failures can be compensated with redundant data. It does not guarantee ACID properties and supports forward only parsing. It supports a rolling restart for configuration changes and minor upgrades. Its random access performance is like MySQL. It is commonly used for search engines and analyzing log data. Wherever a BigTable is required, the HBase can be used.

A Hypertable is another NoSQL database that is like a faster smaller HBase. It is written in C++ and has a GPL license. It is sponsored by Baidu.It's protocol is Thrift, C++ library or HQL shell. It implements Google's BigTable design. It runs on Hadoop's HDFS, It uses its own SQL like language which it calls HQL. It can search by key, by cell or for values in column families. Search can be limited to key/column ranges It retains the last N historical values. The tables appear in namespaces while map/reduce continues to be with Hadoop.
#interviewquestion

Given an array of integers, find count of minimum number of elements to be removed from the array such that the maximum element of the new array is at most twice of the minimum. O(nlogn) solution was required.

It would help to sort the elements and then remove the elements from both front and back that meets the two criteria.

The two criteria are the range of the subarray and the count of items removed
The elements may increase sharply or gradually at either ends so we must consider removing elements from both ends. We can switch ends depending on the lower and upper bound of the range but we cannot stop at the first encounter for the range under consideration because another might satisfy the minimum count criteria.
For example if we have 2 3 4 5 7 9 10 11
Then if we take 2, 3, 4, 11 out we get 5 7 9 10
and this is the minimum number we need to remove.
we look at the item 1 and take its double. We count the elements on the right that are larger than the double. Since the elements are sorted the look up is logN although linear could also do in this case
for each element we consider, we can find the count of elements to be removed. For the element in the left, we can have a range starting with that number. For the element in the right we can have a range ending in that range. Therefore it is better to take half that range.
int getMinCountRemoval(int [] num)
{
Array.Sort(num);
int minCount = num.Length;
for (int i = 0; i < num.Length-1; i++) // this is where we could be smart and switch front or rear alternatingly for range detection until we reach the middle or bail out sooner.
{
int rangeStart = num[i];
int rangeEnd = 2* num[i];
int index = find_next_val(num, rangeEnd); // index of number >= rangeEnd;
int count = num.Length;
if ( index != -1 && index > i){ //found
count = (Length -index - 1) + i;
if (count < minCount){
minCount = count;
}
}else{
count = i-1;
minCount = count +1;
break;
}
}
if (minCount < 0 || minCount == num.Length)
throw new NotFoundException();
return minCount;
}

We continue our discussion on NoSQL databases:
Redis 3.0 is blazingly fast because its database is in memory and is disk backed. Therefore its dataset is limited to RAM. However the RAM is not limited to one machine and includes that of a cluster. It supports simple values or data structures that can be looked up by a key. It supports bit operations, set operations, list operations, hashes and sorted sets. It even has LUA scriptability it supports transactions and publisher subscriber messaging. It comes useful wherever the dataset size can be predicted. It has a BSD license.

From a discussion of the above NoSQL databases, it looks like ElasticSearch is the one that provides most search capabilities. However, none provide the kind of search that say clusty.com provides. Incorporating clustering as part of the search techniques can improve the versatility of the search that these databases provide.

Wednesday, November 4, 2015

Today we continue discussing Big Data.
We compare the following NoSQL databases :
Cassandra: This has an Apache license and is written in Java. It claims to store huge datasets with a query language that is similar to SQL except that it has no joins and no aggregate functions. The queries are done by the keys or key ranges and there is tunable parameter for determining the tradeoffs between distribution and replication. Data can have expiration set on insert. Writes can be much faster than reads. It has very good and reliable cross center data replication. It enables Map/Reduce with Apache Hadoop
MongoDB: This is written entirely in C++. It claims to retain query and indexing properties from SQL It has an AGPL license. It enables Master or Slave replication and auto failover with replica sets. Sharding is built-in. It uses a custom binary protocol called BSON and queries are written as Javascript expressions. These are executed server side. It runs arbitrary javascript functions on server-side.It has a better update in place than CouchDB. It uses memory -mapped files for data storage. Text Search is integrated. This is a convenient transition from SQL to NoSQL where we want to move away from schema and still retain the indexing and querying
ElasticSearch: this has an advanced search capability. It stores json documents and versions them. It can maintain hierarchy between documents. It enables stat-groups that help with debugging. It is querying is rich and powerful and moreover it is script able. It sorts by score and enables fuzzy or approximate searches.
CouchDB is a classic document and big table store. It is written in Erlang and claims to have DB consistency. It has Apache license. It uses HTTP/REST protocol. It supports bidirectional replication in both continuous and adhoc mode, with conflict resolution and hence master to master replication. Because it supports multiversion concurrency document, it supports older versions of documents. It has embedded map/reduce and shares updates in real time with _changes.
Courtesy:kkovacs.eu

Tuesday, November 3, 2015

A comparision of SQL and NoSQL databases. 

The following is literature derived from slides at the Metadata Open Forum. 

NoSQL databases are known to be horizontally scalable and provide an alternative to Relational databases.  We explore the differences between both in terms of standards, characteristics and examples.  

In SQL databases, data is stored in columns and tables. Relationships are represented by primary and foreign keys There is language for data manipulation and data definition.  Transactions are facilitated and generally there is an abstraction from physical layer. This means that the physical layer can change without requiring changes in the application.  Applications specify what they want and not how they want it. Generally the following stages are involved: 

Parsing – The query string is converted into an Abstract Syntax tree 

Binding – This resolves the column or table names such that they have ids as represented in their metadata. 

Optimization - This applies a set of tree transformations to the tree so that the best possible execution plan is drawn. 

Code generation – The generated tree may be stored as is, or transformed partly into code (scalar expressions) or fully into byte code and saved in the plan cache. 

Query execution – While the code generated may be directly run, some of them may be converted into physical operators  

Cleanup – Transactions may be rolled back, locks may be released, logging as appropriate and memory may be reclaimed. 

 A NoSQL Database – on the other hand, is non-relational and distributed by nature. Its characteristics include schema-free representation of data, easy replication support, simple API and a different transaction model called BASE ( Basically available, soft state, eventually consistent ) They are able to handle data akin to warehouses because they scale based on nodes not merely horizontal or vertical partitions.  Inserts and updates are asynchronous. They are versioned and enable optimistic concurrency. Queries start returning answers quickly because because they approximate answers, return stale data, ensure availability first,  provide best effort, and are aggressive, simpler and faster. 

Although the consistency is weak, all node see the same data at the same time and the client perceives that a set of operations has occurred all at once. In other words, they are atomic but error tolerant and can be called multiple times. Node failures do not prevent survivors from continuing to operate. Every operation must terminate in an intended response. Operations will complete , even if individual components are unavailable.  NoSQL databases can appear in many forms but fundamentally the map-reduce operations are similar.  NoSQL databases can be Column Store,  Document Store, and Key-Value Store. A key-value store has a hash table of keys and values stored with keys. It provides fast access to small data values. The Map-Reduce operation consists of two phases :  

Map – extracts the sets of Key-Value pairs from underlying data and potentially in parallel from multiple nodes  

Reduce – merges and sorts sets of key-value pairs and the results may be useful for other searches. Map-Reduce operations are implemented by the application developers not the database. 

Monday, November 2, 2015

We continue discussing the problem on the 2006 sided regular polygon described earlier as
A diagonal of a regular 2006-gon is called odd if its endpoints divide the boundary into
two parts, each composed of an odd number of sides. Sides are also regarded as odd diagonals.
Suppose the 2006-gon has been dissected into triangles by 2003 non-intersecting diagonals.
Find the maximum possible number of isosceles triangles with two odd sides.

We used the lemma earlier that when a dissecting diagonal leaves a shorter path of the boundary of the 2006-gon that consists of n -segments, then the number of iso-odd triangles with vertices on this shorter path does not exceed n/2.
We said that if there is an iso-odd triangle PQS with all vertices on this shorter path and with PQ the diagonal being the longest, then S is the summit of the triangle PQS and the vertices divide the shorter path into four segments of AP, PS, SQ and QB.
Since PQ is part of an iso-odd triangle and is the longest diagonal, an iso-odd triangle cannot have vertices on both the first and the last segment. This means that every iso-odd triangle that fall on the shorter path has all its vertices on one of the four segments.
Given each of these four segments and applying the lemma we started with, we get the sum of the triangles in all these four regions to be less than n/2. Furthermore the two inner segments will have odd number of sides which makes the inequality of the number of triangles to be strictly less than n/2 and leaves one out of two in excess for both these inner edges. Because of this excess, PQS is also covered by this estimate n/2 and the lemma holds.
We can argue the same when the iso-odd triangle with the longest diagonal is drawn the other way with its third vertex not falling on the shorter path. In this case, the triangle is acute or right triangle not the obtuse we had seen earlier. And we have three regions XZ, YZ and XY where XZ and YZ are shorter than XY. For each of the three regions, we apply the lemma and infer that there are no more than n/2 iso-odd triangles in all unless XYZ is one of them. Thus the total number of triangles in the dissection is not greater than n/2.
We can also reason it for arbitrary triangles. Let us say ABC is an iso-odd triangle with AB and BC odd sides. This implies there are odd number of sides between A and B and B and C. We call these sides as belonging to iso odd triangle ABC.
At least one side in each of these triangles do not belong to any other iso-odd triangle because an iso-odd triangle with two sides of equal length and therefore there are even number of sides belonging to it. Eliminating all sides that belong to iso-odd triangles, we get one that isn't. Let us assign these two sides one in each group to the triangle ABC. To each iso-odd triangle, we have thus assigned a pair of sides with no two triangles sharing an assigned side. Therefore there 1003 such iso-odd triangles that can appear.