Cluster computing

Wednesday, May 22, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.
The difference in terms of batch processing or stream processing is not only about latency but also about consistency, aggregation, and the type of data to query on. If there are requirements around latency in the retrieval results, batch processing may not be the way to go because it may take any amount of time before the results for a batch is complete. On the other hand, stream processing may go really fast because it will show the result as it become available

Similarly, consistency also places hard requirements on the choice of the processing. Strict consistency has traditionally been associated with relational databases. Eventual consistency has been made possible in distributed storage with the help of messaging algorithms such as Paxos. Big Data is usually associated with eventual consistency. Facebook’s Presto has made the leap for social media data which usually runs in the order of petabytes.

The type of data that makes its way to such stores is usually document like data. However, batch or stream processing does not necessarily differentiate data.

The processing is restricted by the batch mode or the stream mode but because it can query heterogeneous datastores, it avoids extract-transform-and-load operations.

Node GetSuccessor(Node root)

{

if (root == NULL) return root;

if (root.right)

{

Node current = root.right;

While(current && current.left)

Current = current.left;

Return current;

}

Node parent = root. parent;

While (parent && parent.right == root)

{

root = parent;

parent = parent.parent;

}

return parent;

}

#codingexercise
The number of elements in the perimeter adjacencies of an mxn matrix is given by 2m+2n-4, 2m+2n-10, 2m+2n-20

Cluster computing

Wednesday, May 22, 2019

No comments:

Post a Comment