Cluster computing

Saturday, June 10, 2017

Database queries are just as important as data manipulation and archival and have played an important factor in their use. The kind of queries made has driven the adoption of the SQL standard that continues till date. Even when map-Reduce is the norm for Big Data, the ability to translate SQL queries on Big Data has become popular. Query execution and optimization is heavily dependent on costing model Once the query is sent to the database engine, it is translated into an abstract syntax tree. Then it is bound/resolved to column/table names. Next it is optimized by a set of tree transformations that chooses join orders or rewrites subselects. Both the nodes and the operations might change. The tree may be inserted into the cache and hydrated with code before execution. During execution, another set of tree transformations occur make it interpreted and more stateful. Resources acquired during cleanup are then released on cleanup. On the NoSQL side, say with Hadoop ecosystem, Hive is preferred over Pig because it is very SQL-like. It abstracts the data retrieval but does not support all the operations. Hive is also way more slower than SQL queries because there is no caching involved. It translates to MapReduce and re-runs each time. We realize its value only in context of the scale out efficiencies of Hadoop. A TableScan seems appealing only when the index seeks are in the orders of magnitude more.

Data access patterns that involve locking and logging are generally limiting. When these constraints relaxed with say lock free skip lists, then range queries can actually scale more than B-Trees. These data structures together with multiversion concurrency control make it easier for the database to run in-memory which makes it all the more faster. In traditional databases, there is fixed overhead per query in terms of setting up contexts, preparing the tree for interpretations and running the query. The in-memory databases use code generation to avoid all this. Similarly MVCC gives the same efficiency and correctness as transactions. Given that SQL queries are becoming a standard for working with relational, non-relational and in-memory databases, it should be safe to bet that this will become the norm for any new additions to the group. In fact, in-memory database scale out beyond single server to cluster nodes with communications based on SQL querying. Tables are distributed with hash-partitioning. What in-memory database fail to do is become a data platform like Splunk does for machine data and analysis

#codingexercise

Reverse a singly linked list in groups of K and K+1 alternatively

eg. 1, 2, 3, 4, 5, 6,7 K= 3

3,2,1,7,6,5,4
static Node Reverse(int k, int group, ref Node root)
{
Node current = root;
Node next = null;
Node prev = null;
int count = 0;
int max = k;
if (group %2 == 1) max = k+1 ;
while (current != null && count < max)
{
next = current.next;
current.next = prev;
prev = current;
current = next;
count++;
}
if (root != null)
root.next = Reverse(k, group+1, ref next);
return prev;
}

Cluster computing

Saturday, June 10, 2017

No comments:

Post a Comment