Friday, May 31, 2019

We continue our discussion on Kubernetes framework.

Kubernetes can scale up or down the number of pods an instance supports.  Functionality such as load-balancing,  api gatekeeper, nginx controller and others are important to the applications. These routines are therefore provided out of box from the Kubernetes framework. The only observation here is that this is a constant feedback cycle. The feedback from the applications improves the offerings from the host.
An example of the above cycle can be seen with the help of operator sdk. Originally, the operators were meant to make it easy for applications to be deployed. While there are several tools to facilitate this, Kubernetes proposed the deployment via operators. While applications started out with one operator, today applications tend to write more than one operator. It is a recognition of this fact, that Kubernetes now has new features to support operator dedicated to metrics. These metrics operator are new even for the operator-sdk which as a tool enabled boilerplate code to be generated for most applications
The Kubernetes framework does not need to bundle up all the value additions from routines performed across applications. Instead it can pass through the data to hosts such as the public cloud and leverage the technologies of the host and the cloud. This techniques allows offloading health and performance monitoring to external layers which may already have significant acceptance and consistency
There are no new tools, plugins, add-ons or packages needed by the application when Kubernetes supports these routines. At the same time, applications can choose time to evaluate the necessary conditions for distribution of modules to parts. This frees up the applications and their packages. The packages are increasingly written to be hosted on their own pods.
Separation of the pods also improves modularity and reuse across application clients. This provides the advantage of isolation, troubleshooting and maintenance.

Thursday, May 30, 2019

Kubernetes is a container framework and we were discussing emerging trends with application development.


Storage is a layer best left outside the application logic. The applications continue to streamline their business while the storage provides best practice. When the storage layer is managed, it’s appeal grows. For applications that use the container framework, the storage is generally network accessible . There tasks such as storage virtualization, replication and maintenance can be converged into a layer external from the application and reachable over the network.
Storage is not the only layer. The container framework benefits from its own platform as a service. Plugins that perform routines such as monitoring, healing, auditing, can all be delegated to plugins that can be shipped out of the box from the container framework. Then these bundles can be different sized.
Kubernetes itself is very helpful for applications of all sizes. These plugins are merely add-ons.
Kubernetes can scale up or down the number of pods an instance supports.  Functionality such as load-balancing,  api gatekeeper, nginx controller and others are important to the applications. These routines are therefore provided out of box from the Kubernetes framework. The only observation here is that this is a constant feedback cycle. The feedback from the applications improves the offerings from the host.
An example of the above cycle can be seen with the help of operator sdk. Originally, the operators were meant to make it easy for applications to be deployed. While there are several tools to facilitate this, Kubernetes proposed the deployment via operators. While applications started out with one operator, today applications tend to write more than one operator. It is a recognition of this fact, that Kubernetes now has new features to support operator dedicated to metrics. These metrics operator are new even for the operator-sdk which as a tool enabled boilerplate code to be generated for most applications

Wednesday, May 29, 2019

This design and relaxation of performance requirements from applications hosted on Kubernetes facilitates different connectors not just volume mounts. Just like we have log appenders publish logs to a variety of destinations, connectors help persist data written from the application to a variety of storage providers using consolidators, queues, cache and mechanisms that know how and when to write the data.
Unfortunately, the native Kubernetes API does not support any other forms of storage connectors other than the VolumeMount but it does allow services to be written in the form of Kubernetes applications that can accept the data published over http(s) just like a time series database server accepts all kinds of events over the net. The configuration of the endpoint, the binding of the service and the contract associated with the service vary from app to app. This may call for a well-known consolidator app that can provide different storage class that support different application profiles. Appenders and connectors are popular design patterns that get re-used often and justify their business value.
The shared data volume can bee made read-only and accessible only to the pods. This facilitates access restrictions. While authentication, authorization and audit can be enabled for storage connectors, they will still require RBAC access. Therefore, service accounted become necessary with storage connectors. A side-benefit of this security is that the accesses can now be monitored and alerted.
Storage is a layer best left outside the application logic. The applications continue to streamline their business while the storage provides best practice. When the storage layer is managed, it’s appeal grows. For applications that use the container framework, the storage is generally network accessible . There tasks such as storage virtualization, replication and maintenance can be converged into a layer external from the application and reachable over the network.
Storage is not the only layer. The container framework benefits from its own platform as a service. Plugins that perform routines such as monitoring, healing, auditing, can all be delegated to plugins that can be shipped out of the box from the container framework. Then these bundles can be different sized.
Kubernetes itself is very helpful for applications of all sizes. These plugins are merely add-ons.

Tuesday, May 28, 2019

Kubernetes provides a familiar notion of shared storage system with the help of VolumeMounts accessible from each container. The idea is that a shared file system may be considered local to the container and reused regardless of the container. The file system protocols have always facilitated the local and remote file storage with their support for distributed file systems. This allows for databases, configurations and secrets to be available on disk across containers and provide single point of maintenance. Most storage regardless of which storage access protocol – file system protocols, http(s), block or stream are essentially moving data to storage so there is a transfer and latency involved.
The only question has been what latency, and I/O throughput is acceptable for the application and this has guided the decisions for the storage systems, appliances and their integrations. When the storage is tightly coupled with the compute such as between a database server and a database file, all the reads and writes incurred from performance benchmarks require careful arrangement of bytes, their packing, organization, index, checksums and error codes.  But most applications hosted on Kubernetes don’t have the same requirements as a database server.
This design and relaxation of performance requirements from applications hosted on Kubernetes facilitates different connectors not just volume mounts. Just like we have log appenders publish logs to a variety of destinations, connectors help persist data written from the application to a variety of storage providers using consolidators, queues, cache and mechanisms that know how and when to write the data.
Unfortunately, the native Kubernetes API does not support any other forms of storage connectors other than the VolumeMount but it does allow services to be written in the form of Kubernetes applications that can accept the data published over http(s) just like a time series database server accepts all kinds of events over the net. The configuration of the endpoint, the binding of the service and the contract associated with the service vary from app to app. This may call for a well-known consolidator app that can provide different storage class that support different application profiles. Appenders and connectors are popular design patterns that get re-used often and justify their business value.
The shared data volume can bee made read-only and accessible only to the pods. This facilitates access restrictions. While authentication, authorization and audit can be enabled for storage connectors, they will still require RBAC access. Therefore, service accounted become necessary with storage connectors. A side-benefit of this security is that the accesses can now be monitored and alerted.

Monday, May 27, 2019

The following is a continuation of the summary of some of the core concepts of Kubernetes.

Namespaces seclude names of resources. They can even be nested within one another. They provide a means to divide resources between multiple users.

Most Kubernetes resources such as pods, services, replication, controllers, and others are in some namespaces. However, low level resources such as nodes and persistent volumes are not in any namespace.

Kubernetes control plane communication is bidirectional between the cluster to its master and vice-versa. The master hosts an apiserver that is configured to listen for remote connections. The apiserver reaches out to the kubelets to fetch logs, attach to running pods, and provide the port-forwarding functionality. The apiserver manages nodes, pods and services.

Kubernetes has cluster level logging. This stores all of the container logs and sends it to a central log store. The centralized store is then easy to search or browse via an interface. Common kubectl commands are also included. The name of the log file is log-file.log and it goes through rotations. The “kubectl logs” command uses this log file

The System components do not always run in the container.  So, in the cases where the systemd is available, the logs are written to the journald. The node-level logging agent runs on each node. The sidecar container streams to stdout but picks up logs from an application counter using a logging agent.

Logs can also be directly written from the application to a backend log store.




Sunday, May 26, 2019

Today I discuss a coding exercise:
Let us traverse a m x n matrix spirally to find the kth element. A typical method for this would look like:
int GetKth(int[,] A, int m, int n, int k)
{
if (n <1 || m < 1) return -1;
if (k <= m)
    return A[0, k-1];
if (k <= m+n-1)
   return A[k-m, m-1];
if (k <= m+n-1+m-1)
   return A[n-1, (m-1-(k-(m+n-1)))];
if (k <= m+n-1+m-1+n-2)
   return A[n-1-(k-(m+n-1+m-1)), 0];
return GetKth(A.SubArray(1,1,m-2,n-2), m-2, n-2, k-(2*n+2*m-4)));
}
Notice that this makes incremental albeit slow progress towards the goal in a consistent small but meaningful peels towards the finish.
Instead, we could also skip ahead. This will unpeel the spirals by skipping several adjacent rows and columns at a time. The value of k has to be in the upper half of the number of elements in the matrix before it is used.
6When k Is in this range, it can be reduced by 8x, 4x, 2x adjacent perimeter elements before it fits in the half of the given matrix and the above method to walk the spiral can be used. \If we skip adjacent perimeter from the outermost in the m×n matrix, we can skip over the number of elements as 2m+2n-4, 2m+2n-10, 2m+2n-20. In such cases we can quickly reduce k till we can walk the perimeter spiral of the inner matrix starting from the top left.

This follows a  pattern 2m +2 (n-2) , 2 (m-2) + 2 (n-4) , 2 (m-4) +2 (n-6), …

while ( k – ( 4m + 4n – 14) > m*n/2) {

k -= 4m + 4n – 14;

m = m – 4;

n = m – 2;

}

This pattern can be rewritten as 2(m +(n-2) ,  (m-2) +  (n-4) ,  (m-4) + (n-6), …)
which can be written as 2 (m+n-2(0+1), m+n-2 (1+2), m+n-2 (2+3), …)
which can be written as Sum (I = 0, 1, 2 …)(4m + 4n – 4 (I + I +1))
which can be written as Sum (I = 0, 1, 2 …)(4m + 4n – 8i – 4)


Saturday, May 25, 2019

A piece of the puzzle:
This essay talks about connecting the public cloud with third party multi factor authentication (MFA) provider as an insight into identity related technologies in modern computing. Many organizations participate in multi factor authentication for their applications. At the same time, they expect the machines deployed in their private cloud to be joined to their corporate network. If this private cloud were to be hosted on a public cloud as a virtual private cloud, it would require some form of Active Directory Connector. This AD connector is a proxy which connects to the Active directory that is on – premise for the entire organization as a membership registry.  By configuring the connector to work with a  third party MFA provider like Okta, we centralize all the access requests and streamline the process.
Each MFA provider makes an agent available to download and it typically talks the ldap protocol with the membership registry instance The agent is installed on a server with access to the domain controller. 
We can eliminate login and password hassles by connecting the public cloud resources to the organization’s membership provider so that the existing corporate credentials can be used to login.
Further more, for new credentials, this lets us automatically provision, update or de-provision public cloud accounts when we update the organization’s membership provider on any Windows Server with access to the Domain Controller. 
Thus a single corporate account can bridge public and private clouds for unified sign in experience 

Friday, May 24, 2019

The following is a summary of some of the core concepts of Kubernetes as required to write an operator. Before we begin, Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services. It is often a strategic decision for any company because it decouples the application from the hosts so that the same application can work elsewhere with minimal disruption to its use.
An operator is a way of automating the deployment of an application on the Kubernetes cluster. It is written with the help of template source code generated by a tool called the operator-sdk. The tool builds three components – custom resource, apis and controllers. The custom resources are usually declarative definition of Kubernetes resources required by the application and its grouping as suited for the deployment of the application. The api is for the custom service required to deploy the application and the controller watches for this service.
Kubernetes does not limit the type of applications that are supported. It provides building blocks to the application. Containers only help isolate modules of the application into well defined boundaries that can run in with operating system level virtualization.
Kubernetes exposes a set of APIs that are used internally by the command line tool called kubectl. It is also used externally by other applications. This API follows the regular REST convention and is also versioned with path qualifiers such as v1, v1alpha1 or v1beta1 – the latter are used with extensions to the APIs.
Kubernetes supports Imperative commands, imperative object configuration, and declarative object configuration. These are different approaches to manage objects. The first approach operates on live objects and is recommended for development environment. The latter two are configurations that operate on individual files or a set of files and these are better suited for production environments.
Namespaces seclude names of resources. They can even be nested within one another. They provide a means to divide resources between multiple users.
Most Kubernetes resources such as pods, services, replication, controllers,  and others  are in some namespaces. However, low level resources such as nodes and persistent volumes are not in any namespace.
Kubernetes control plane communication is bidirectional between the cluster to its master and vice-versa. The master hosts an apiserver that is configured to listen for remote connections. The apiserver reaches out to the kubelets to fetch logs, attach to running pods, and provide the port-forwarding functionality. The apiserver manages nodes, pods and services.

Thursday, May 23, 2019

The use of SQL query language is a matter of convenience. As long as the datasets can be enumerated, SQL can be used to query the data. This has been demonstrated by utilities like the LogParser. The universal appeal of SQL is that it is well-established and convenient for many. These not only include developers and testers but also applications  that find it easy to switch persistence layer when the query language is standard 
In fact, querying has become such an important aspect of any problem solving that analysts agree it contributes to the requirement for the right tool which is usually eighty percent of getting to a solution. The other 20% is merely to understand the problem we wish to solve.
To illustrate how querying is important, let us take a look at few topologies of organizing data at any scale. If we have large enterprise data accumulations with a data warehouse, whether it is on –premise or in the cloud, the standard way of interacting with it has been SQL.  Similarly, in-memory database scale out beyond single server to cluster nodes with communications based on SQL querying. Distributed databases send and receive data based on distributed queries again written in SQL.  Whether it is Social Engineering Company or a giant in online store, data marshaled through services eventually query based on SQL. Over the last two decades, the momentum of people adopting SQL to query data has only snowballed. This reflects in applications and public company offerings that have taken to cloud and clusters in their design.
Perhaps the biggest innovations in Query language has been the use of user defined operators and computations to perform the work associated with the data. These have resulted in stored logic such as the stored procedures which are written in a variety of languages. With the advent of machine learning and data mining algorithms, these have enabled support for new languages and packages as well as algorithms that are now available right out of the box and shipped with the tool.

Wednesday, May 22, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.
The difference in terms of batch processing or stream processing is not only about latency but also about consistency, aggregation, and the type of data to query on. If there are requirements around latency in the retrieval results, batch processing may not be the way to go because it may take any amount of time before the results for a batch is complete. On the other hand, stream processing may go really fast because it will show the result as it become available

Similarly, consistency also places hard requirements on the choice of the processing. Strict consistency has traditionally been associated with relational databases. Eventual consistency has been made possible in distributed storage with the help of messaging algorithms such as Paxos. Big Data is usually associated with eventual consistency. Facebook’s Presto has made the leap for social media data which usually runs in the order of petabytes.

The type of data that makes its way to such stores is usually document like data.  However, batch or stream processing does not necessarily differentiate data.

The processing is restricted by the batch mode or the stream mode but because it can query heterogeneous datastores, it avoids extract-transform-and-load operations.

Node GetSuccessor(Node root)
{
if (root == NULL) return root;
if (root.right)
{
Node current = root.right;
While(current && current.left)
       Current = current.left;
Return current;
}
Node parent = root. parent;
While (parent && parent.right == root)
{
   root = parent;
   parent = parent.parent;
}
return parent;
}

#codingexercise
The number of elements in the perimeter adjacencies of an mxn matrix is given by 2m+2n-4, 2m+2n-10, 2m+2n-20

Tuesday, May 21, 2019

Querying:

We were discussing the querying on Key-Value collections that are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.
Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Presto from Facebook – a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources.


Monday, May 20, 2019

Querying:

Key-Value collections are ubiquitous in documents and object storage. Their querying is handled natively as per the data store. This translates to a query popularly described in SQL language over relational store as a join where the key-values can be considered a table with columns as key and value pair. The desired keys to include in the predicate can be put in a separate temporary table holding just the keys of interest and a join can be performed between the two based on the match between the keys.
Without the analogy of the join, the key-value collections will require standard query operators like where clause which may test for a match against a set of keys. This is rather expensive compared to the join because we do this with a large list of key-values and possibly repeated iterations over the entire list for matches against one or more keys in the provided set.
Any match for a key regardless of whether it is in a join or against a list requires efficient traversal over all the keys. This is only possible if the keys are already arranged in a sorted order and the keys are efficiently search using the techniques of divide and conquer where the range that does not have the key can be ignored and the search repeated on the remaining range. This technique of binary search is an efficient algorithm that only works when the large list of data is arranged. A useful data structure for this purpose is the B-plus tree that allows a set of keys to be stored or treated as a unit while organizing the units in a tree like manner.
Most key-value collections are scoped. They are not necessarily in a large global list. Such key-values become scoped to the document or the object. The document may be in one of two forms – Json and Xml. The Json format has its own query language referred to as jmesPath and the Xml also support path-based queries. When the key-values are scoped, they can be efficiently searched by an application using standard query operators without requiring the use of paths inherent to a document format as Json or Xml.
However, there are no query languages specific to objects because the notion of object is one of encapsulation and opacity as opposed to documents that are open. However, objects are not limited to binary images or files. They can be text based and often the benefit of using object storage as opposed to a document store is their availability over the web protocols. For example, object store supports S3 control and data path that is universally recognized as a web accessible storage. Document store on the other hand provide easy programmability options to conventional SQL statements and their usage is popular for the syntax like insertMany() and find(). Stream queries are supported in both document and object stores with the help of cursor.
#codingexercise
when we find the kth element spiralwise in a square matrix, we can skip the elements rather than walk them.

Sunday, May 19, 2019

Flink Security is based on Kerberos authentication. Consequently, it does not publish a documentation on integration with KeyCloak or Active Directory. This document helps to identify the requirements around deploying any product using Flink or by itself in an organizational setting. All access to storage artifacts are considered integrated with Role Based Access Control. If internal accounts are used for communication with storage providers, there will be a role to role mapping between Flink and the tier 2 storage provider.

Keycloak : is an OpenID identity provider. It provides authentication albeit limited authorization over several membership providers including active directory. It has a service broker that facilitates authentication over a variety of providers and reuses the context for authentication across sites
Flink was primarily designed to work with Kafka, HDFS, HBase and Zookeeper.  It satisfies the security requirements for all these connectors by providing Kerberos based authentication. Each connector corresponds to its own security module so Kerberos may be turned on for each independently. Each security module has to call out its cross-site security configuration to the connector otherwise the operating system account of the cluster is used to authenticate with the cluster. Kerberos tickets and renewal help span long periods of time which is often required for tasks performed on data streams. In a standalone mode, only a keytab file is necessary. Keytab helps renew tickets and keytabs do not expire in such timeframe. Without a keytab or a cluster involvement to renew tickets, only a ticket cache provided by kinit is sufficient

Saturday, May 18, 2019

Sequence Analysis: 
Data is increasing more than ever and at a fast pace. Algorithms for data analysis are required to become more robust, efficient and accurate. Specializations in databases, higher end processors suitable for artificial intelligence have contributed to improvements in data analysis. Data mining techniques discover patterns in the data and are useful for predictions but they tend to require traditional databases. 
Sequence databases are highly specialized and even though they can be supported by B-Tree data structure that many contemporary databases use, they tend to be larger than many commercial databases.  In addition, algorithms for mining non-sequential rules focus on generating all sequential rules. These algorithms produce an enormous number of redundant rules. The large number not only makes mining inefficient; it also hampers iterations. Such algorithms depend on patterns obtained from earlier frequent pattern mining algorithms. However, if the rules are normalized and redundancies removed, they become efficient to be stored and used with a sequence database. 
The data structures used for sequence rules have evolved. The use of a dynamic bit vector data structure is now an alternative. The data mining process involves a prefix tree. Early data processing stages tend to prune, clean and perform canonicalization and these have reduced the rules. 
In the context of text mining, sequences have had limited applications because the ordering of words has never been important for determining the topics. However, salient keywords regardless of their locations and coherent enough for a topic tend to form sequences rather than groups. Finding semantic information with word vectors does not help with this ordering.  They are two independent variables. And the word vectors are formed only with a predetermined set of dimensions. These dimensions do not increase significantly with progressive text. There is no method for vectors to be redefined with increasing dimensions as text progresses.  
The number and scale of dynamic groups of word vectors can be arbitrarily large. The ordering of the words can remain alphabetical. These words can then map to a word vector table where the features are predetermined giving the table a rather fixed number of columns instead of leaving it to be a big table.  
Since there is a lot of content with similar usage of words and phrases and almost everyone uses the language in day to day simple English, there is a higher likelihood that some of these groups will stand out in terms of usage and frequency. When we have exhaustively collected and persisted frequently co-occuring words in a groupopedia as interpreted from large corpus with no limit to the number of words in a group and the groups-persisted in the sorted order of their frequencies, then we have a two-fold ability to shred a given text into pre-determined groups there-by instantly recognizing topics and secondly adding to pseudo word vectors where groups translate as vectors in the vector table.  
Sequence Mining is relatively new. It holds promise for keywords and not regular texts. Keyword extraction is itself performed with a variety of statistical and mining techniques. Therefore, sequence mining adds value only when keyword extraction is good. 
#codingexercise
https://github.com/ravibeta/ecs-samples/tree/master/ecs-go-certs

Friday, May 17, 2019


Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Column storage.  We focus on the use case of columns because they are not relational and find many applications that are similar to the use cases of object storage. They also tend to become large with significant read-write access. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a  queue layer and a  cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that Columns are similar to fragmented data transfer and how object storage can serve both as source and destination of Columns.
Column storage gained popularity because cells could be grouped in columns rather than rows. Read and writes are over columns enabling fast data access and aggregation. Their need for storage is not very different from applications requiring object storage. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Column
File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files.  Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files.
Since Column storage overlays on Tier 2 storage on top of blocks, files and blobs, it is already transferring data to object storage. However, the reverse is not that frequent although objects in a storage class can continue to be serialized to Column in a continuous manner. It is also symbiotic to audience on both storage.
As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues can be given native support in unstructured storage. With the data finding universal storage in object storage, we can now focus on making nodes as objects and edges as keys so that the web accessible node can participate in Column processing and keep it exclusively compute-oriented.
Let us look at some of the search queries that are typical of searching the logs of identity provider:  
1) Query to report on callers from different origins - The Login screen for the identity provider may be visited from different domains. The count of requests from each of these origins can be easily found by looking for the referrer domains and adding a count for each occurrence.  
  
2) Determine users who used two step verification and their success rate - The growth in popularity of one-time passcodes over captcha and other challenge questions could be plotted on a graph as a trend by tagging the requests that performed these operations. One of the reasons one-time passcodes are popular is that unlike other forms they have less chance of going wrong. The user is guaranteed to get a fresh code and the code will successfully authenticate the user. OTP is used in many workflows for this purpose. 

3) Searching for login attempts - The above scenario also leads us to evaluate the conditions where customers did end up re-attempting where the captcha or their interaction on the page did not work. The hash of failures and their counts will determine which of these is a significant source of error. One of the outcomes of this is that we may discover some forms of challenges as not suitable for the user. In these cases, it is easier to migrate the user to other workflows. 
  
4) Modifications made to account state - Some of the best indicators of fraudulent activity is the pattern of access of account state whether it is to read or write. For example, the address, zip code and payment methods of the account change less frequently than the password for the user. If these do change often for a user and from different source, they may lead to fraud detection.  
Logs, clickstreams and metrics are only some of the ways to gain insight into customer activities related to their identity. While there may be other ways, identity provides a unique perspective to any holistic troubleshooting.  
Identity also provides boundless possibilities for analysis as a data source. 

Thursday, May 16, 2019

Identity Impersonation continued:
Perhaps the most difficult task in mitigating identity threats, is the task of detecting identity spoofing and impersonation. Identity is only determined based on credentials. If the credentials become compromised, there is no one to prevent anyone from assuming the identity of another.

There are several mitigations for the owner for this purpose. Yet the sad truth is that even the best mitigation has been known to overcome or circumvented. These include improving forms of credentials in terms of what the owner knows or has. The latter has taken the form of one time passcodes, fingerprints and other biometrics, and keys. The system to recognize these forms or authentication mechanisms become increasingly complex and often to the hackers advantage.

However, impersonation is sometime necessary and even desirable in some cases. Systems frequently allow workers to impersonate the user so that they can proceed with the security context of the user. In such cases, identity is already established prior to impersonation. The mechanism is usually easier within the system boundary rather than outside it primarily because system can exchange tokens representing users.

The stealing of identity can cause significant trouble for the owner as it is widely known from those who suffer from credit card fraudulent activity where someone else impersonates the user. In the digital world, the compromise of identity often implies a risk that goes beyond personal computing. Most of the resources are on the network and a hacker can easily gain access to privileged areas of the code.

There are two ways of looking at this and they correspond to two different organizations such as the white hat and the black hat organizations who study and expose these vulnerabilities.

They have a large membership and their findings are well-covered in publications, news reports and circulations.

Software makers including those with IAM modules often release patches based  on their findings. Some of them even have their own set of penetration testers who user old and new findings to test the security of the product.


Wednesday, May 15, 2019

Identity Metrics 

Metrics can accumulate over time. They are best stored in a time-series database not only because they are machine data but also because they have a propensity to timeline Since they are lightweight in their size but frequent in their publish, the metrics are fine grained data that can cost the network and the servers if the flow is not planned. 

Metrics for identity are more than when a user signs in or signs out. There is a lot of business value in collecting reports on a user’s login activity across sites and applications. For instance, it helps determine user’s interest which can help with predictions and advertising. Users themselves want to subscribe to alerts and notifications for their activities. On the other hand, there is plenty of troubleshooting support from metrics. Generally, logs have been the source for offline analysis, however, metrics also support reporting in the form of beautiful charts and graphs. 

As with many metrics, there is a sliding window for timestamp based datapoint collections for the same metric. Although metrics support flexible naming convention, prefixes and paths, the same metric may have a lot of datapoints over time. A sliding window presents a limited range that can help with aggregation functions such as latest, maximum, minimum and count. Queries on metrics is facilitated with the help of annotations on the metric data and pre-defined metadata. These queries can use any language but the queries using search operators are preferred for their similarity to shell based execution environments. 
Identity metrics will generally flow into a time-series database along with other metrics from the same timeline. These databases have collection agents in the form of forwarders and they handle a limited set of loads, so they scale based on the system that generates the metrics. The metrics are usually sent across over the wire in the form of a single http request per metric or a batch of metrics. That is why other protocols such as syslogs for transfer of files or blobs are preferred.  

Metrics unlike logs are ephemeral. They are only useful for a certain period after which either cumulative metrics are regenerated or they are ignored. With the retention period being small, metrics are served best from near real time queries. Often users find reports and notifications more convenient than queries. 


int GetKthAntiClockWise(int[,] A, int m, int n, int k)
{
if (n <1 || m < 1) return -1;
if (k <= n)
    return A[k-1, 0]; 
if (k <= n+m-1)
   return A[n-1, k-n]; 
if (k <= n+m-1+n-1)
   return A[(n-1-(k-(n+m-1))), m-1];
if (k <= n+m-1+n-1+m-2)
   return A[0, m-1-(k-(n+m-1+n-1))]; 
return GetKthAntiClockWise(A.SubArray(1,1,m-2,n-2), m-2, n-2, k-(2*n+2*m-4)));
}


Tuesday, May 14, 2019

Identity as a delegated application.
Most Identity Access management solutions are tied to some membership providers where the request from the user can be authenticated and even authorized. It represents a singleton global instance in an organization that wants to authenticate its members. Such a paradigm inevitably leads to centralized global database-oriented mechanisms.  This is primarily due to the fact that most sign-in requests to the users are coming with credentials that a user has or knows. However, such credentials are also stored in a vault or secure store. Consequently, the authentication process is merely an automation process that involves retrieving the credentials and validating it against an IAM provider.
Although vaults, membership providers and authentication modules are centralized for several users, they can also be dedicated for a single user. Such a use case drives applications such as mobile wallet, passport and keychains that can be personal rather than a central repository. It is these use cases that significantly expand the notion of authentication module as not necessarily tied to a single entity or worker but rather a co-ordination between dedicated and global instances.
A personal assistant closer to the user and dedicated to the user can take the credentials once and expand it to all the realms that the user navigates to. It can sign in and sign out the user seamlessly allowing greater mobility and productivity than ever before.  The applications that interact with the personal assistant can do so over a variety of protocols and workflows enabling possibilities that were not available earlier.
Distributed authentication frameworks has to be differentiated by virtue of the user it serves. If the membership provider is distributed rather than centralized, that is unknown to the user. While this may be a significant distributed computing perspective, it is not personalized and certainly does not break up the well established design of traditional systems.

Monday, May 13, 2019


Columnar store overlay over object storage
Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Column storage.  We focus on the use case of columns because they are not relational and find many applications that are similar to the use cases of object storage. They also tend to become large with significant read-write access. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a  queue layer and a  cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that Columns are similar to fragmented data transfer and how object storage can serve both as source and destination of Columns.
Column storage gained popularity because cells could be grouped in columns rather than rows. Read and writes are over columns enabling fast data access and aggregation. Their need for storage is not very different from applications requiring object storage. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Column 
File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files.  Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files.  
Since Column storage overlays on Tier 2 storage on top of blocks, files and blobs, it is already transferring data to object storage. However, the reverse is not that frequent although objects in a storage class can continue to be serialized to Column in a continuous manner. It is also symbiotic to audience on both storage. 
An Object Storage offers better features and cost management, as it continues to stand out against most competitors in the unstructured storage. The processors lower the costs of usage so that the total cost of ownership is also lowered making the object storage whole lot more profitable to the end users.