Cluster computing

Saturday, May 25, 2019

A piece of the puzzle:

This essay talks about connecting the public cloud with third party multi factor authentication (MFA) provider as an insight into identity related technologies in modern computing. Many organizations participate in multi factor authentication for their applications. At the same time, they expect the machines deployed in their private cloud to be joined to their corporate network. If this private cloud were to be hosted on a public cloud as a virtual private cloud, it would require some form of Active Directory Connector. This AD connector is a proxy which connects to the Active directory that is on – premise for the entire organization as a membership registry. By configuring the connector to work with a third party MFA provider like Okta, we centralize all the access requests and streamline the process.

Each MFA provider makes an agent available to download and it typically talks the ldap protocol with the membership registry instance The agent is installed on a server with access to the domain controller.

We can eliminate login and password hassles by connecting the public cloud resources to the organization’s membership provider so that the existing corporate credentials can be used to login.

Further more, for new credentials, this lets us automatically provision, update or de-provision public cloud accounts when we update the organization’s membership provider on any Windows Server with access to the Domain Controller.

Thus a single corporate account can bridge public and private clouds for unified sign in experience

Friday, May 24, 2019

The following is a summary of some of the core concepts of Kubernetes as required to write an operator. Before we begin, Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services. It is often a strategic decision for any company because it decouples the application from the hosts so that the same application can work elsewhere with minimal disruption to its use.
An operator is a way of automating the deployment of an application on the Kubernetes cluster. It is written with the help of template source code generated by a tool called the operator-sdk. The tool builds three components – custom resource, apis and controllers. The custom resources are usually declarative definition of Kubernetes resources required by the application and its grouping as suited for the deployment of the application. The api is for the custom service required to deploy the application and the controller watches for this service.
Kubernetes does not limit the type of applications that are supported. It provides building blocks to the application. Containers only help isolate modules of the application into well defined boundaries that can run in with operating system level virtualization.
Kubernetes exposes a set of APIs that are used internally by the command line tool called kubectl. It is also used externally by other applications. This API follows the regular REST convention and is also versioned with path qualifiers such as v1, v1alpha1 or v1beta1 – the latter are used with extensions to the APIs.
Kubernetes supports Imperative commands, imperative object configuration, and declarative object configuration. These are different approaches to manage objects. The first approach operates on live objects and is recommended for development environment. The latter two are configurations that operate on individual files or a set of files and these are better suited for production environments.
Namespaces seclude names of resources. They can even be nested within one another. They provide a means to divide resources between multiple users.
Most Kubernetes resources such as pods, services, replication, controllers, and others are in some namespaces. However, low level resources such as nodes and persistent volumes are not in any namespace.
Kubernetes control plane communication is bidirectional between the cluster to its master and vice-versa. The master hosts an apiserver that is configured to listen for remote connections. The apiserver reaches out to the kubelets to fetch logs, attach to running pods, and provide the port-forwarding functionality. The apiserver manages nodes, pods and services.

Thursday, May 23, 2019

The use of SQL query language is a matter of convenience. As long as the datasets can be enumerated, SQL can be used to query the data. This has been demonstrated by utilities like the LogParser. The universal appeal of SQL is that it is well-established and convenient for many. These not only include developers and testers but also applications that find it easy to switch persistence layer when the query language is standard

In fact, querying has become such an important aspect of any problem solving that analysts agree it contributes to the requirement for the right tool which is usually eighty percent of getting to a solution. The other 20% is merely to understand the problem we wish to solve.

To illustrate how querying is important, let us take a look at few topologies of organizing data at any scale. If we have large enterprise data accumulations with a data warehouse, whether it is on –premise or in the cloud, the standard way of interacting with it has been SQL. Similarly, in-memory database scale out beyond single server to cluster nodes with communications based on SQL querying. Distributed databases send and receive data based on distributed queries again written in SQL. Whether it is Social Engineering Company or a giant in online store, data marshaled through services eventually query based on SQL. Over the last two decades, the momentum of people adopting SQL to query data has only snowballed. This reflects in applications and public company offerings that have taken to cloud and clusters in their design.

Perhaps the biggest innovations in Query language has been the use of user defined operators and computations to perform the work associated with the data. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with the tool.

Wednesday, May 22, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.
The difference in terms of batch processing or stream processing is not only about latency but also about consistency, aggregation, and the type of data to query on. If there are requirements around latency in the retrieval results, batch processing may not be the way to go because it may take any amount of time before the results for a batch is complete. On the other hand, stream processing may go really fast because it will show the result as it become available

Similarly, consistency also places hard requirements on the choice of the processing. Strict consistency has traditionally been associated with relational databases. Eventual consistency has been made possible in distributed storage with the help of messaging algorithms such as Paxos. Big Data is usually associated with eventual consistency. Facebook’s Presto has made the leap for social media data which usually runs in the order of petabytes.

The type of data that makes its way to such stores is usually document like data. However, batch or stream processing does not necessarily differentiate data.

The processing is restricted by the batch mode or the stream mode but because it can query heterogeneous datastores, it avoids extract-transform-and-load operations.

Node GetSuccessor(Node root)

{

if (root == NULL) return root;

if (root.right)

{

Node current = root.right;

While(current && current.left)

Current = current.left;

Return current;

}

Node parent = root. parent;

While (parent && parent.right == root)

{

root = parent;

parent = parent.parent;

}

return parent;

}

#codingexercise
The number of elements in the perimeter adjacencies of an mxn matrix is given by 2m+2n-4, 2m+2n-10, 2m+2n-20

Tuesday, May 21, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.
Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.

Monday, May 20, 2019

Querying:

Key-Value collections are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.
Any match for a key regardless of whether it is in a join or against a list requires efficient traversal over all the keys. This is only possible if the keys are already arranged in a sorted order and the keys are efficiently search using the techniques of divide and conquer where the range that does not have the key can be ignored and the search repeated on the remaining range. This technique of binary search is an efficient algorithm that only works when the large list of data is arranged. A useful data structure for this purpose is the B-plus tree that allows a set of keys to be stored or treated as a unit while organizing the units in a tree like manner.
Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.
However, there are no query languages specific to objects because the notion of object is one of encapsulation and opacity as opposed to documents that are open. However, objects are not limited to binary images or files. They can be text based and often the benefit of using object storage as opposed to a document store is their availability over the web protocols. For example, object store supports S3 control and data path that is universally recognized as a web accessible storage. Document store on the other hand provide easy programmability options to conventional SQL statements and their usage is popular for the syntax like insertMany() and find(). Stream queries are supported in both document and object stores with the help of cursor.
#codingexercise
when we find the kth element spiralwise in a square matrix, we can skip the elements rather than walk them.

Sunday, May 19, 2019

Flink Security is based on Kerberos authentication. Consequently, it does not publish a documentation on integration with KeyCloak or Active Directory. This document helps to identify the requirements around deploying any product using Flink or by itself in an organizational setting. All access to storage artifacts are considered integrated with Role Based Access Control. If internal accounts are used for communication with storage providers, there will be a role to role mapping between Flink and the tier 2 storage provider.

Keycloak : is an OpenID identity provider. It provides authentication albeit limited authorization over several membership providers including active directory. It has a service broker that facilitates authentication over a variety of providers and reuses the context for authentication across sites
Flink was primarily designed to work with Kafka, HDFS, HBase and Zookeeper. It satisfies the security requirements for all these connectors by providing Kerberos based authentication. Each connector corresponds to its own security module so Kerberos may be turned on for each independently. Each security module has to call out its cross-site security configuration to the connector otherwise the operating system account of the cluster is used to authenticate with the cluster. Kerberos tickets and renewal help span long periods of time which is often required for tasks performed on data streams. In a standalone mode, only a keytab file is necessary. Keytab helps renew tickets and keytabs do not expire in such timeframe. Without a keytab or a cluster involvement to renew tickets, only a ticket cache provided by kinit is sufficient