Cluster computing

Thursday, November 5, 2015

Today we continue the discussion on NoSQL Databases. We have HBase which is written in Java. It can store billions of rows times millions of columns and is modeled after Google's BigTable. It has an Apache License. Like CouchDB, it also supports HTTP/REST. HBase is built on a framework of data storage and data processing services.The former is named HDFS, short for Hadoop Distributed File System. The latter is named MapReduce and uses a high performance parallel data processing technique. A data warehouse named Hive, and a query language named Pig is also available over this framework. Together this forms the Hadoop framework. Hbase stores key-value as columns in a column family and each row can have more than one column family. Each row need not have the same number of columns. Query predicates can be pushed down via server side scan and get filters. Optimizations are for realtime queries. Hadoop scales horizontally in that additional commodity hardware can be added without interruption and node failures can be compensated with redundant data. It does not guarantee ACID properties and supports forward only parsing. It supports a rolling restart for configuration changes and minor upgrades. Its random access performance is like MySQL. It is commonly used for search engines and analyzing log data. Wherever a BigTable is required, the HBase can be used.

A Hypertable is another NoSQL database that is like a faster smaller HBase. It is written in C++ and has a GPL license. It is sponsored by Baidu.It's protocol is Thrift, C++ library or HQL shell. It implements Google's BigTable design. It runs on Hadoop's HDFS, It uses its own SQL like language which it calls HQL. It can search by key, by cell or for values in column families. Search can be limited to key/column ranges It retains the last N historical values. The tables appear in namespaces while map/reduce continues to be with Hadoop.
#interviewquestion

Given an array of integers, find count of minimum number of elements to be removed from the array such that the maximum element of the new array is at most twice of the minimum. O(nlogn) solution was required.

It would help to sort the elements and then remove the elements from both front and back that meets the two criteria.

The two criteria are the range of the subarray and the count of items removed
The elements may increase sharply or gradually at either ends so we must consider removing elements from both ends. We can switch ends depending on the lower and upper bound of the range but we cannot stop at the first encounter for the range under consideration because another might satisfy the minimum count criteria.
For example if we have 2 3 4 5 7 9 10 11
Then if we take 2, 3, 4, 11 out we get 5 7 9 10
and this is the minimum number we need to remove.
we look at the item 1 and take its double. We count the elements on the right that are larger than the double. Since the elements are sorted the look up is logN although linear could also do in this case
for each element we consider, we can find the count of elements to be removed. For the element in the left, we can have a range starting with that number. For the element in the right we can have a range ending in that range. Therefore it is better to take half that range.
int getMinCountRemoval(int [] num)
{
Array.Sort(num);
int minCount = num.Length;
for (int i = 0; i < num.Length-1; i++) // this is where we could be smart and switch front or rear alternatingly for range detection until we reach the middle or bail out sooner.
{
int rangeStart = num[i];
int rangeEnd = 2* num[i];
int index = find_next_val(num, rangeEnd); // index of number >= rangeEnd;
int count = num.Length;
if ( index != -1 && index > i){ //found
count = (Length -index - 1) + i;
if (count < minCount){
minCount = count;
}
}else{
count = i-1;
minCount = count +1;
break;
}
}
if (minCount < 0 || minCount == num.Length)
throw new NotFoundException();
return minCount;
}

We continue our discussion on NoSQL databases:
Redis 3.0 is blazingly fast because its database is in memory and is disk backed. Therefore its dataset is limited to RAM. However the RAM is not limited to one machine and includes that of a cluster. It supports simple values or data structures that can be looked up by a key. It supports bit operations, set operations, list operations, hashes and sorted sets. It even has LUA scriptability it supports transactions and publisher subscriber messaging. It comes useful wherever the dataset size can be predicted. It has a BSD license.

From a discussion of the above NoSQL databases, it looks like ElasticSearch is the one that provides most search capabilities. However, none provide the kind of search that say clusty.com provides. Incorporating clustering as part of the search techniques can improve the versatility of the search that these databases provide.

Wednesday, November 4, 2015

Today we continue discussing Big Data.
We compare the following NoSQL databases :
Cassandra: This has an Apache license and is written in Java. It claims to store huge datasets with a query language that is similar to SQL except that it has no joins and no aggregate functions. The queries are done by the keys or key ranges and there is tunable parameter for determining the tradeoffs between distribution and replication. Data can have expiration set on insert. Writes can be much faster than reads. It has very good and reliable cross center data replication. It enables Map/Reduce with Apache Hadoop
MongoDB: This is written entirely in C++. It claims to retain query and indexing properties from SQL It has an AGPL license. It enables Master or Slave replication and auto failover with replica sets. Sharding is built-in. It uses a custom binary protocol called BSON and queries are written as Javascript expressions. These are executed server side. It runs arbitrary javascript functions on server-side.It has a better update in place than CouchDB. It uses memory -mapped files for data storage. Text Search is integrated. This is a convenient transition from SQL to NoSQL where we want to move away from schema and still retain the indexing and querying
ElasticSearch: this has an advanced search capability. It stores json documents and versions them. It can maintain hierarchy between documents. It enables stat-groups that help with debugging. It is querying is rich and powerful and moreover it is script able. It sorts by score and enables fuzzy or approximate searches.
CouchDB is a classic document and big table store. It is written in Erlang and claims to have DB consistency. It has Apache license. It uses HTTP/REST protocol. It supports bidirectional replication in both continuous and adhoc mode, with conflict resolution and hence master to master replication. Because it supports multiversion concurrency document, it supports older versions of documents. It has embedded map/reduce and shares updates in real time with _changes.
Courtesy:kkovacs.eu

Tuesday, November 3, 2015

A comparision of SQL and NoSQL databases. 

The following is literature derived from slides at the Metadata Open Forum. 

NoSQL databases are known to be horizontally scalable and provide an alternative to Relational databases.  We explore the differences between both in terms of standards, characteristics and examples.  

In SQL databases, data is stored in columns and tables. Relationships are represented by primary and foreign keys There is language for data manipulation and data definition.  Transactions are facilitated and generally there is an abstraction from physical layer. This means that the physical layer can change without requiring changes in the application.  Applications specify what they want and not how they want it. Generally the following stages are involved: 

Parsing – The query string is converted into an Abstract Syntax tree 

Binding – This resolves the column or table names such that they have ids as represented in their metadata. 

Optimization - This applies a set of tree transformations to the tree so that the best possible execution plan is drawn. 

Code generation – The generated tree may be stored as is, or transformed partly into code (scalar expressions) or fully into byte code and saved in the plan cache. 

Query execution – While the code generated may be directly run, some of them may be converted into physical operators  

Cleanup – Transactions may be rolled back, locks may be released, logging as appropriate and memory may be reclaimed. 

 A NoSQL Database – on the other hand, is non-relational and distributed by nature. Its characteristics include schema-free representation of data, easy replication support, simple API and a different transaction model called BASE ( Basically available, soft state, eventually consistent ) They are able to handle data akin to warehouses because they scale based on nodes not merely horizontal or vertical partitions.  Inserts and updates are asynchronous. They are versioned and enable optimistic concurrency. Queries start returning answers quickly because because they approximate answers, return stale data, ensure availability first,  provide best effort, and are aggressive, simpler and faster. 

Although the consistency is weak, all node see the same data at the same time and the client perceives that a set of operations has occurred all at once. In other words, they are atomic but error tolerant and can be called multiple times. Node failures do not prevent survivors from continuing to operate. Every operation must terminate in an intended response. Operations will complete , even if individual components are unavailable.  NoSQL databases can appear in many forms but fundamentally the map-reduce operations are similar.  NoSQL databases can be Column Store,  Document Store, and Key-Value Store. A key-value store has a hash table of keys and values stored with keys. It provides fast access to small data values. The Map-Reduce operation consists of two phases :  

Map – extracts the sets of Key-Value pairs from underlying data and potentially in parallel from multiple nodes  

Reduce – merges and sorts sets of key-value pairs and the results may be useful for other searches. Map-Reduce operations are implemented by the application developers not the database. 

Monday, November 2, 2015

We continue discussing the problem on the 2006 sided regular polygon described earlier as
A diagonal of a regular 2006-gon is called odd if its endpoints divide the boundary into
two parts, each composed of an odd number of sides. Sides are also regarded as odd diagonals.
Suppose the 2006-gon has been dissected into triangles by 2003 non-intersecting diagonals.
Find the maximum possible number of isosceles triangles with two odd sides.

We used the lemma earlier that when a dissecting diagonal leaves a shorter path of the boundary of the 2006-gon that consists of n -segments, then the number of iso-odd triangles with vertices on this shorter path does not exceed n/2.
We said that if there is an iso-odd triangle PQS with all vertices on this shorter path and with PQ the diagonal being the longest, then S is the summit of the triangle PQS and the vertices divide the shorter path into four segments of AP, PS, SQ and QB.
Since PQ is part of an iso-odd triangle and is the longest diagonal, an iso-odd triangle cannot have vertices on both the first and the last segment. This means that every iso-odd triangle that fall on the shorter path has all its vertices on one of the four segments.
Given each of these four segments and applying the lemma we started with, we get the sum of the triangles in all these four regions to be less than n/2. Furthermore the two inner segments will have odd number of sides which makes the inequality of the number of triangles to be strictly less than n/2 and leaves one out of two in excess for both these inner edges. Because of this excess, PQS is also covered by this estimate n/2 and the lemma holds.
We can argue the same when the iso-odd triangle with the longest diagonal is drawn the other way with its third vertex not falling on the shorter path. In this case, the triangle is acute or right triangle not the obtuse we had seen earlier. And we have three regions XZ, YZ and XY where XZ and YZ are shorter than XY. For each of the three regions, we apply the lemma and infer that there are no more than n/2 iso-odd triangles in all unless XYZ is one of them. Thus the total number of triangles in the dissection is not greater than n/2.
We can also reason it for arbitrary triangles. Let us say ABC is an iso-odd triangle with AB and BC odd sides. This implies there are odd number of sides between A and B and B and C. We call these sides as belonging to iso odd triangle ABC.
At least one side in each of these triangles do not belong to any other iso-odd triangle because an iso-odd triangle with two sides of equal length and therefore there are even number of sides belonging to it. Eliminating all sides that belong to iso-odd triangles, we get one that isn't. Let us assign these two sides one in each group to the triangle ABC. To each iso-odd triangle, we have thus assigned a pair of sides with no two triangles sharing an assigned side. Therefore there 1003 such iso-odd triangles that can appear.

Friday, October 30, 2015

A diagonal of a regular 2006-gon is called odd if its endpoints divide the boundary into
two parts, each composed of an odd number of sides. Sides are also regarded as odd diagonals.
Suppose the 2006-gon has been dissected into triangles by 2003 non-intersecting diagonals.
Find the maximum possible number of isosceles triangles with two odd sides.

An isosceles triangle is called odd if it has two odd sides. A triangle in the dissection above which is odd and isosceles is called iso-odd.
We make an observation as follows:
Let AB be one of the dissecting diagonals and let Lambda be one of the shorter part of the boundary of the 2006-gon with endpoints A, B. Suppose that lambda consists of n segments. Then the number of iso-odd triangles with vertices on Lambda does not exceed n/2.
We can prove this by induction . For the initial case of n = 2, there can only be one such triangle. So this holds. Let us assume this observation holds for all lambda of length < n. Then we consider the case when Lambda == n. Let PQ be the longest diagonal which is a side of an iso-odd triangle PQS with all vertices on Lamda. If there is no such triangle, then we are not increasing the number of iso-odd triangles and hence the observation holds. If on the other hand, there is such a triangle, then every triangle whose vertices lie on lambda is right angled or obtuse with the maximum possible when it form the triangle PQS with the summit at S. Then we may assume that the five points APSQB lie on lambda and partition lambda into four pieces of AP, PS, SQ and QB each where possibly the outer two segments reduce to a point. Since PQ is the longest diagonal and PQS is an iso-odd triangles, it cannot have vertices on both AP and QB the outer segments. Therefore every iso -odd triangle lying within lambda has its vertices on just one of four pieces.Each of these pieces we can apply the same induction and we can add the four inequalities,we get the required number to not exceed n/2.

The intuition behind this approach is that there are only three vertices we need to pick to for such triangles and given that they can lie arbitrarily along the segments we categorize them into forms that the triangle can take and where those vertices may fall. Once we have a lemma for the number of such triangles, we can easily study it in each of the categories

Thursday, October 29, 2015

Today we review the book The 5 Choices by Kory Kogon, Adam Merrill, and Leena Rinne.
This book is about the path to extraordinary productivity when the workplace is getting a whole lot overwhelming and exhausting. Texts, Emails, interruptions meetings etc inundate us and distract us from thinking clearly to make good decisions and accomplish what matters. Therefore this book shows how to manage time and energy through five fundamental choices.
The book promises to increase our productivity at the individual team and organization levels and to feel engaged, accomplished and satisfied.
The book shares new ideas on how to be productive, how to get clear and focus on the things that matter to us, how to increase our capability to make decisions and how to recognize that you have the ability to do extraordinary work. The Five choices are enumerated as follows:
Choice 1 is to act on the important and not to react to the urgent. The authors say that we either react or we think. The reacting is the lower level where we process our feelings and emotions for a response but don't necessarily think it out. Thinking on the other hand is the upper level where we make conscious and intentional decisions. The longer we use the upper plane the better the outcome. We have some tools to do this. FranklinCovey's Time Matrix provides the framework and Pause-Clarify-Decide provides the process. The time matrix is a four quadrant visualizing tool that helps us categorize activities. The top left quadrant is the quadrant of necessity and contains things that are both urgent and important. The lower left quadrant is the quadrant of distraction. This is the one that gives a negative return on your attention and energy. The lower right quadrant is the quadrant of waste. We should try not to be here at all. The upper right quadrant is the one for extraordinary productivity. To spend more time in this quadrant, we should pause to think what item belongs in which quadrant and once its clear, we decide accordingly.
Choice 2 is to go for the extraordinary and not to settle for the ordinary. For each item that we place in the quadrant 2, we can attach a statement or story as we do on our scrum meetings. We anchor it to a purpose and passion. It helps us to tap into our motivation and higher performance.
Choice 3 is to schedule the big rocks and not to sort gravel. Quantifying our work items is akin to attaching story points to our stories. Consequently, we don't spend time on the little things that clutter up and prevent us from executing on the big items. A master task list is probably one of the best visualizing tools for this Since the list is the master and it contains items that we want to execute, we have all the items in our consideration and ranked appropriately.
Choice 4 is to rule the technology and not let it rule us. Very often we spend more time learning to use a tool or to work with a tool with the assumption that the tool will save us, but as a Japanese swordsman once said being independent of your tools and being present is the first step. You can manage the information you need to process by organizing them into appointments, tasks, contacts and notes/documents.
Choice 4 also includes a set of master moves that make our execution efficient. The first master move is to win without fighting Here we automate much of our decision making process. The second master move is to eliminate any additional quadrant 3 and 4 items from our inbox. The third master move is to enable more preparedness by including a link to locate so we spend hardly any time in Q1.
Choice 5 is to fuel our fire and not to burn it out. The five drivers of mental and physical energy are Moving,eating, sleeping, relaxing and connecting. Each of them is powerful in making us more productive.

Wednesday, October 28, 2015

Q: An equilateral triangle, partitioned into n2 congruent equilateral triangles by n-1 equidistant parallels to each of its three sides. Two chess-like bishops placed at any two vertices of the small triangles are said to menace one another if they lie on a same parallel. The problem is to determine the largest number of bishops that can be placed so that none menaces another.

A: A bishop may be assigned three coordinates a; b; c, namely the numbers of sides of small triangles they are off each of the sides of the big triangle.

In other words the condition above is that if i != j; then ai != aj ; bi != bj and ci != cj .
If any pair of these axes coordinates were same, then the bishops would be on the same triangle. If they are on the same triangle, they share the same parallel that made the triangle and hence would menace each other.

In fact the only solution that divides the number of congruent triangles is when the smaller triangle formed by the bishops is itself an equilateral triangle.
The first condition gives us the inequality that pairwise parallel-coordinates are distinct so we have their sum as N(N-1)/2
summing all the three co-ordiantes we have 3N(N-1)/2 <= sum of all( ai+bi+ci)
Since all the triangles are congruent,N = 4
therefore a + b + c = n

By solving this problem, we have the obtained the first condition as in the problem earlier. And again showed that the limiting case of equality is also included as a solution.

#codingexercise
Determine the type of a triangle given its edge lengths
Void printTriangle( int a, int b, int c)
{
If (a == b && b==c) printf ("equilateral");
If (a ==b || b == c || c == a) printf ("isosceles");
printf ("scalene");
}