Cluster computing

Thursday, May 23, 2019

The use of SQL query language is a matter of convenience. As long as the datasets can be enumerated, SQL can be used to query the data. This has been demonstrated by utilities like the LogParser. The universal appeal of SQL is that it is well-established and convenient for many. These not only include developers and testers but also applications that find it easy to switch persistence layer when the query language is standard

In fact, querying has become such an important aspect of any problem solving that analysts agree it contributes to the requirement for the right tool which is usually eighty percent of getting to a solution. The other 20% is merely to understand the problem we wish to solve.

To illustrate how querying is important, let us take a look at few topologies of organizing data at any scale. If we have large enterprise data accumulations with a data warehouse, whether it is on –premise or in the cloud, the standard way of interacting with it has been SQL. Similarly, in-memory database scale out beyond single server to cluster nodes with communications based on SQL querying. Distributed databases send and receive data based on distributed queries again written in SQL. Whether it is Social Engineering Company or a giant in online store, data marshaled through services eventually query based on SQL. Over the last two decades, the momentum of people adopting SQL to query data has only snowballed. This reflects in applications and public company offerings that have taken to cloud and clusters in their design.

Perhaps the biggest innovations in Query language has been the use of user defined operators and computations to perform the work associated with the data. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with the tool.

Wednesday, May 22, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.
The difference in terms of batch processing or stream processing is not only about latency but also about consistency, aggregation, and the type of data to query on. If there are requirements around latency in the retrieval results, batch processing may not be the way to go because it may take any amount of time before the results for a batch is complete. On the other hand, stream processing may go really fast because it will show the result as it become available

Similarly, consistency also places hard requirements on the choice of the processing. Strict consistency has traditionally been associated with relational databases. Eventual consistency has been made possible in distributed storage with the help of messaging algorithms such as Paxos. Big Data is usually associated with eventual consistency. Facebook’s Presto has made the leap for social media data which usually runs in the order of petabytes.

The type of data that makes its way to such stores is usually document like data. However, batch or stream processing does not necessarily differentiate data.

The processing is restricted by the batch mode or the stream mode but because it can query heterogeneous datastores, it avoids extract-transform-and-load operations.

Node GetSuccessor(Node root)

{

if (root == NULL) return root;

if (root.right)

{

Node current = root.right;

While(current && current.left)

Current = current.left;

Return current;

}

Node parent = root. parent;

While (parent && parent.right == root)

{

root = parent;

parent = parent.parent;

}

return parent;

}

#codingexercise
The number of elements in the perimeter adjacencies of an mxn matrix is given by 2m+2n-4, 2m+2n-10, 2m+2n-20

Tuesday, May 21, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.
Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.

Monday, May 20, 2019

Querying:

Key-Value collections are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.
Any match for a key regardless of whether it is in a join or against a list requires efficient traversal over all the keys. This is only possible if the keys are already arranged in a sorted order and the keys are efficiently search using the techniques of divide and conquer where the range that does not have the key can be ignored and the search repeated on the remaining range. This technique of binary search is an efficient algorithm that only works when the large list of data is arranged. A useful data structure for this purpose is the B-plus tree that allows a set of keys to be stored or treated as a unit while organizing the units in a tree like manner.
Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.
However, there are no query languages specific to objects because the notion of object is one of encapsulation and opacity as opposed to documents that are open. However, objects are not limited to binary images or files. They can be text based and often the benefit of using object storage as opposed to a document store is their availability over the web protocols. For example, object store supports S3 control and data path that is universally recognized as a web accessible storage. Document store on the other hand provide easy programmability options to conventional SQL statements and their usage is popular for the syntax like insertMany() and find(). Stream queries are supported in both document and object stores with the help of cursor.
#codingexercise
when we find the kth element spiralwise in a square matrix, we can skip the elements rather than walk them.

Sunday, May 19, 2019

Flink Security is based on Kerberos authentication. Consequently, it does not publish a documentation on integration with KeyCloak or Active Directory. This document helps to identify the requirements around deploying any product using Flink or by itself in an organizational setting. All access to storage artifacts are considered integrated with Role Based Access Control. If internal accounts are used for communication with storage providers, there will be a role to role mapping between Flink and the tier 2 storage provider.

Keycloak : is an OpenID identity provider. It provides authentication albeit limited authorization over several membership providers including active directory. It has a service broker that facilitates authentication over a variety of providers and reuses the context for authentication across sites
Flink was primarily designed to work with Kafka, HDFS, HBase and Zookeeper. It satisfies the security requirements for all these connectors by providing Kerberos based authentication. Each connector corresponds to its own security module so Kerberos may be turned on for each independently. Each security module has to call out its cross-site security configuration to the connector otherwise the operating system account of the cluster is used to authenticate with the cluster. Kerberos tickets and renewal help span long periods of time which is often required for tasks performed on data streams. In a standalone mode, only a keytab file is necessary. Keytab helps renew tickets and keytabs do not expire in such timeframe. Without a keytab or a cluster involvement to renew tickets, only a ticket cache provided by kinit is sufficient

Saturday, May 18, 2019

Sequence Analysis:

Data is increasing more than ever and at a fast pace. Algorithms for data analysis are required to become more robust, efficient and accurate. Specializations in databases, higher end processors suitable for artificial intelligence have contributed to improvements in data analysis. Data mining techniques discover patterns in the data and are useful for predictions but they tend to require traditional databases.

Sequence databases are highly specialized and even though they can be supported by B-Tree data structure that many contemporary databases use, they tend to be larger than many commercial databases. In addition, algorithms for mining non-sequential rules focus on generating all sequential rules. These algorithms produce an enormous number of redundant rules. The large number not only makes mining inefficient; it also hampers iterations. Such algorithms depend on patterns obtained from earlier frequent pattern mining algorithms. However, if the rules are normalized and redundancies removed, they become efficient to be stored and used with a sequence database.

The data structures used for sequence rules have evolved. The use of a dynamic bit vector data structure is now an alternative. The data mining process involves a prefix tree. Early data processing stages tend to prune, clean and perform canonicalization and these have reduced the rules.

In the context of text mining, sequences have had limited applications because the ordering of words has never been important for determining the topics. However, salient keywords regardless of their locations and coherent enough for a topic tend to form sequences rather than groups. Finding semantic information with word vectors does not help with this ordering. They are two independent variables. And the word vectors are formed only with a predetermined set of dimensions. These dimensions do not increase significantly with progressive text. There is no method for vectors to be redefined with increasing dimensions as text progresses. 

The number and scale of dynamic groups of word vectors can be arbitrarily large. The ordering of the words can remain alphabetical. These words can then map to a word vector table where the features are predetermined giving the table a rather fixed number of columns instead of leaving it to be a big table. 

Since there is a lot of content with similar usage of words and phrases and almost everyone uses the language in day to day simple English, there is a higher likelihood that some of these groups will stand out in terms of usage and frequency. When we have exhaustively collected and persisted frequently co-occuring words in a groupopedia as interpreted from large corpus with no limit to the number of words in a group and the groups-persisted in the sorted order of their frequencies, then we have a two-fold ability to shred a given text into pre-determined groups there-by instantly recognizing topics and secondly adding to pseudo word vectors where groups translate as vectors in the vector table. 

Sequence Mining is relatively new. It holds promise for keywords and not regular texts. Keyword extraction is itself performed with a variety of statistical and mining techniques. Therefore, sequence mining adds value only when keyword extraction is good.
#codingexercise
https://github.com/ravibeta/ecs-samples/tree/master/ecs-go-certs

Friday, May 17, 2019

Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Column storage. We focus on the use case of columns because they are not relational and find many applications that are similar to the use cases of object storage. They also tend to become large with significant read-write access. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a queue layer and a cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that Columns are similar to fragmented data transfer and how object storage can serve both as source and destination of Columns.
Column storage gained popularity because cells could be grouped in columns rather than rows. Read and writes are over columns enabling fast data access and aggregation. Their need for storage is not very different from applications requiring object storage. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Column
File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files.  Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files.
Since Column storage overlays on Tier 2 storage on top of blocks, files and blobs, it is already transferring data to object storage. However, the reverse is not that frequent although objects in a storage class can continue to be serialized to Column in a continuous manner. It is also symbiotic to audience on both storage.
As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues can be given native support in unstructured storage. With the data finding universal storage in object storage, we can now focus on making nodes as objects and edges as keys so that the web accessible node can participate in Column processing and keep it exclusively compute-oriented.

https://github.com/ravibeta/ecs-samples/tree/master/ecs-go-certs