Cluster computing

Saturday, May 18, 2019

Sequence Analysis:

Data is increasing more than ever and at a fast pace. Algorithms for data analysis are required to become more robust, efficient and accurate. Specializations in databases, higher end processors suitable for artificial intelligence have contributed to improvements in data analysis. Data mining techniques discover patterns in the data and are useful for predictions but they tend to require traditional databases.

Sequence databases are highly specialized and even though they can be supported by B-Tree data structure that many contemporary databases use, they tend to be larger than many commercial databases. In addition, algorithms for mining non-sequential rules focus on generating all sequential rules. These algorithms produce an enormous number of redundant rules. The large number not only makes mining inefficient; it also hampers iterations. Such algorithms depend on patterns obtained from earlier frequent pattern mining algorithms. However, if the rules are normalized and redundancies removed, they become efficient to be stored and used with a sequence database.

The data structures used for sequence rules have evolved. The use of a dynamic bit vector data structure is now an alternative. The data mining process involves a prefix tree. Early data processing stages tend to prune, clean and perform canonicalization and these have reduced the rules.

In the context of text mining, sequences have had limited applications because the ordering of words has never been important for determining the topics. However, salient keywords regardless of their locations and coherent enough for a topic tend to form sequences rather than groups. Finding semantic information with word vectors does not help with this ordering. They are two independent variables. And the word vectors are formed only with a predetermined set of dimensions. These dimensions do not increase significantly with progressive text. There is no method for vectors to be redefined with increasing dimensions as text progresses. 

The number and scale of dynamic groups of word vectors can be arbitrarily large. The ordering of the words can remain alphabetical. These words can then map to a word vector table where the features are predetermined giving the table a rather fixed number of columns instead of leaving it to be a big table. 

Since there is a lot of content with similar usage of words and phrases and almost everyone uses the language in day to day simple English, there is a higher likelihood that some of these groups will stand out in terms of usage and frequency. When we have exhaustively collected and persisted frequently co-occuring words in a groupopedia as interpreted from large corpus with no limit to the number of words in a group and the groups-persisted in the sorted order of their frequencies, then we have a two-fold ability to shred a given text into pre-determined groups there-by instantly recognizing topics and secondly adding to pseudo word vectors where groups translate as vectors in the vector table. 

Sequence Mining is relatively new. It holds promise for keywords and not regular texts. Keyword extraction is itself performed with a variety of statistical and mining techniques. Therefore, sequence mining adds value only when keyword extraction is good.
#codingexercise
https://github.com/ravibeta/ecs-samples/tree/master/ecs-go-certs

Friday, May 17, 2019

Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Column storage. We focus on the use case of columns because they are not relational and find many applications that are similar to the use cases of object storage. They also tend to become large with significant read-write access. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a queue layer and a cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that Columns are similar to fragmented data transfer and how object storage can serve both as source and destination of Columns.
Column storage gained popularity because cells could be grouped in columns rather than rows. Read and writes are over columns enabling fast data access and aggregation. Their need for storage is not very different from applications requiring object storage. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Column
File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files.  Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files.
Since Column storage overlays on Tier 2 storage on top of blocks, files and blobs, it is already transferring data to object storage. However, the reverse is not that frequent although objects in a storage class can continue to be serialized to Column in a continuous manner. It is also symbiotic to audience on both storage.
As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues can be given native support in unstructured storage. With the data finding universal storage in object storage, we can now focus on making nodes as objects and edges as keys so that the web accessible node can participate in Column processing and keep it exclusively compute-oriented.

https://github.com/ravibeta/ecs-samples/tree/master/ecs-go-certs

Let us look at some of the search queries that are typical of searching the logs of identity provider:

1) Query to report on callers from different origins - The Login screen for the identity provider may be visited from different domains. The count of requests from each of these origins can be easily found by looking for the referrer domains and adding a count for each occurrence.

2) Determine users who used two step verification and their success rate - The growth in popularity of one-time passcodes over captcha and other challenge questions could be plotted on a graph as a trend by tagging the requests that performed these operations. One of the reasons one-time passcodes are popular is that unlike other forms they have less chance of going wrong. The user is guaranteed to get a fresh code and the code will successfully authenticate the user. OTP is used in many workflows for this purpose.

3) Searching for login attempts - The above scenario also leads us to evaluate the conditions where customers did end up re-attempting where the captcha or their interaction on the page did not work. The hash of failures and their counts will determine which of these is a significant source of error. One of the outcomes of this is that we may discover some forms of challenges as not suitable for the user. In these cases, it is easier to migrate the user to other workflows.

4) Modifications made to account state - Some of the best indicators of fraudulent activity is the pattern of access of account state whether it is to read or write. For example, the address, zip code and payment methods of the account change less frequently than the password for the user. If these do change often for a user and from different source, they may lead to fraud detection.

Logs, clickstreams and metrics are only some of the ways to gain insight into customer activities related to their identity. While there may be other ways, identity provides a unique perspective to any holistic troubleshooting.

Identity also provides boundless possibilities for analysis as a data source.

Thursday, May 16, 2019

Identity Impersonation continued:
Perhaps the most difficult task in mitigating identity threats, is the task of detecting identity spoofing and impersonation. Identity is only determined based on credentials. If the credentials become compromised, there is no one to prevent anyone from assuming the identity of another.

There are several mitigations for the owner for this purpose. Yet the sad truth is that even the best mitigation has been known to overcome or circumvented. These include improving forms of credentials in terms of what the owner knows or has. The latter has taken the form of one time passcodes, fingerprints and other biometrics, and keys. The system to recognize these forms or authentication mechanisms become increasingly complex and often to the hackers advantage.

However, impersonation is sometime necessary and even desirable in some cases. Systems frequently allow workers to impersonate the user so that they can proceed with the security context of the user. In such cases, identity is already established prior to impersonation. The mechanism is usually easier within the system boundary rather than outside it primarily because system can exchange tokens representing users.

The stealing of identity can cause significant trouble for the owner as it is widely known from those who suffer from credit card fraudulent activity where someone else impersonates the user. In the digital world, the compromise of identity often implies a risk that goes beyond personal computing. Most of the resources are on the network and a hacker can easily gain access to privileged areas of the code.

There are two ways of looking at this and they correspond to two different organizations such as the white hat and the black hat organizations who study and expose these vulnerabilities.

They have a large membership and their findings are well-covered in publications, news reports and circulations.

Software makers including those with IAM modules often release patches based on their findings. Some of them even have their own set of penetration testers who user old and new findings to test the security of the product.

Wednesday, May 15, 2019

Identity Metrics

Metrics can accumulate over time. They are best stored in a time-series database not only because they are machine data but also because they have a propensity to timeline Since they are lightweight in their size but frequent in their publish, the metrics are fine grained data that can cost the network and the servers if the flow is not planned.

Metrics for identity are more than when a user signs in or signs out. There is a lot of business value in collecting reports on a user’s login activity across sites and applications. For instance, it helps determine user’s interest which can help with predictions and advertising. Users themselves want to subscribe to alerts and notifications for their activities. On the other hand, there is plenty of troubleshooting support from metrics. Generally, logs have been the source for offline analysis, however, metrics also support reporting in the form of beautiful charts and graphs.

As with many metrics, there is a sliding window for timestamp based datapoint collections for the same metric. Although metrics support flexible naming convention, prefixes and paths, the same metric may have a lot of datapoints over time. A sliding window presents a limited range that can help with aggregation functions such as latest, maximum, minimum and count. Queries on metrics is facilitated with the help of annotations on the metric data and pre-defined metadata. These queries can use any language but the queries using search operators are preferred for their similarity to shell based execution environments.

Identity metrics will generally flow into a time-series database along with other metrics from the same timeline. These databases have collection agents in the form of forwarders and they handle a limited set of loads, so they scale based on the system that generates the metrics. The metrics are usually sent across over the wire in the form of a single http request per metric or a batch of metrics. That is why other protocols such as syslogs for transfer of files or blobs are preferred.

Metrics unlike logs are ephemeral. They are only useful for a certain period after which either cumulative metrics are regenerated or they are ignored. With the retention period being small, metrics are served best from near real time queries. Often users find reports and notifications more convenient than queries.

int GetKthAntiClockWise(int[,] A, int m, int n, int k)
{
if (n <1 || m < 1) return -1;
if (k <= n)
return A[k-1, 0];
if (k <= n+m-1)
return A[n-1, k-n];
if (k <= n+m-1+n-1)
return A[(n-1-(k-(n+m-1))), m-1];
if (k <= n+m-1+n-1+m-2)
return A[0, m-1-(k-(n+m-1+n-1))];
return GetKthAntiClockWise(A.SubArray(1,1,m-2,n-2), m-2, n-2, k-(2*n+2*m-4)));
}

Tuesday, May 14, 2019

Identity as a delegated application.
Most Identity Access management solutions are tied to some membership providers where the request from the user can be authenticated and even authorized. It represents a singleton global instance in an organization that wants to authenticate its members. Such a paradigm inevitably leads to centralized global database-oriented mechanisms. This is primarily due to the fact that most sign-in requests to the users are coming with credentials that a user has or knows. However, such credentials are also stored in a vault or secure store. Consequently, the authentication process is merely an automation process that involves retrieving the credentials and validating it against an IAM provider.
Although vaults, membership providers and authentication modules are centralized for several users, they can also be dedicated for a single user. Such a use case drives applications such as mobile wallet, passport and keychains that can be personal rather than a central repository. It is these use cases that significantly expand the notion of authentication module as not necessarily tied to a single entity or worker but rather a co-ordination between dedicated and global instances.
A personal assistant closer to the user and dedicated to the user can take the credentials once and expand it to all the realms that the user navigates to. It can sign in and sign out the user seamlessly allowing greater mobility and productivity than ever before. The applications that interact with the personal assistant can do so over a variety of protocols and workflows enabling possibilities that were not available earlier.
Distributed authentication frameworks has to be differentiated by virtue of the user it serves. If the membership provider is distributed rather than centralized, that is unknown to the user. While this may be a significant distributed computing perspective, it is not personalized and certainly does not break up the well established design of traditional systems.

Monday, May 13, 2019

Columnar store overlay over object storage

Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Column storage. We focus on the use case of columns because they are not relational and find many applications that are similar to the use cases of object storage. They also tend to become large with significant read-write access. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a queue layer and a cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that Columns are similar to fragmented data transfer and how object storage can serve both as source and destination of Columns.
Column storage gained popularity because cells could be grouped in columns rather than rows. Read and writes are over columns enabling fast data access and aggregation. Their need for storage is not very different from applications requiring object storage. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Column
File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files.  Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files.
Since Column storage overlays on Tier 2 storage on top of blocks, files and blobs, it is already transferring data to object storage. However, the reverse is not that frequent although objects in a storage class can continue to be serialized to Column in a continuous manner. It is also symbiotic to audience on both storage.
An Object Storage offers better features and cost management, as it continues to stand out against most competitors in the unstructured storage. The processors lower the costs of usage so that the total cost of ownership is also lowered making the object storage whole lot more profitable to the end users.