Cluster computing: April 2019

Tuesday, April 30, 2019

We continue discussing keys and certificates. They are used to secure data by using the public key to encrypt and the private key to decrypt. The certificate is used as a stamp of authority. Certificates can include the public key. The certificate then becomes usable to secure the ends of a channel such as https. They may be bundled as keystores and truststores.

Keystores is a combination of key and certificate. It is made available in the form of a file with pfx extension or p12 extension. Many applications prefer accepting a keystore rather than keys and certificates separately. The key Store is essentially a header and a collection of bags. one bag may contain the private key while another may contain the certificate. There can be more than one certificates in the bag.

The truststores is merely a collection of certificates to be trusted. It could include a certificate chain if the certificates are signed.

The keystore and truststore can be one and the same if the connections are internal. In this case, the client and the server share the same key-certificate. On the other hand, mutual authentication is one where the server and the client present different certificates. In this sequence of message exchanges for mutual authentication between the server and the client, the server initiates the messages. First, the server sends hello message. Next it sends the certificate, followed by a request to get the client’s certificate and lastly the server-side hello done message. The client responds first with its certificate. Then it sends the session key with the client key exchange message. Then it sends the certificate verify message and changes the cipher spec. Lastly it sends the client-side finished message. The server closes the mutual authentication with the cipher changed message and the server-side finished message.

Common issues encountered with generating these bundles is mentioned here: https://github.com/ravibeta/go-pkcs12/commit/db4cf640b9698ad37d9d170a9a75bf49d7425b71 line 427

Monday, April 29, 2019

Today we discuss keys and certificates. They are used to secure data by using the public key to encrypt and the private key to decrypt. The certificate is used as a stamp of authority. Certificates can include the public key. The certificate then becomes usable to secure the ends of a channel such as https.
Keys and certificates are therefore as important to keep safe as passwords. These keys and certificates can be cut as many times as necessary and for different scopes and purposes. When the number of such secrets increases, they have to be managed in a vault or an secret management system. There are external key managers available for this purpose.
let us take a look at their formats. The choice of encryption algorithm provided different content and format for the keys and certificates. over time, different versions became recognized as standards even for the same encryption parameters. Finally, users began requesting them to be bundled as keystores and truststores.
Keystores is a combination of key and certificate. It is made available in the form of a file with pfx extension or p12 extension. Many applications prefer accepting a keystore rather than keys and certificates separately. The key Store is essentially a header and a collection of bags. one bag may contain the private key while another may contain the certificate. There can be more than one certificates in the bag.
The truststores is merely a collection of certificates to be trusted. It could include a certificate chain if the certificates are signed.
The generation of these bundles is specified in the RFC and performed with command line tools such as openssl and keytool. Not all languages have libraries to generate these bundles since it is generally not a logic to be included in an application. if the keys and certificates are cut once, they don't need to be modified again.
Common issues encountered with generating these bundles is mentioned here: https://github.com/ravibeta/go-pkcs12/commit/db4cf640b9698ad37d9d170a9a75bf49d7425b71 line 427

Sunday, April 28, 2019

Summary of a book titled "The Four" written by Scott Galloway.

This book is about the DNA of giants and how awe inspiring they are with respect to others. They certainly give the rest of the businesses a whole new meaning given their size. These majestic companies are:

1) Apple: Cash on hand is nearly the GDP of Denmark. The author uses the metaphor of sex.

2) Amazon: Market capitalization is about 748 B greater than the sum of all others. The author uses the metaphor of consumption for Amazon

3) Facebook: Market capitalization per employee is about 21 million. The author uses this company as a symbol of Love since it has perfected the art of interactions that make people happy and “Happiness is love”

4) Google: ages in reverse as it accrues data harnessing the power of 2 billion people in terms of what they want and what they choose. The author uses the metaphor of God

With the use of metaphors, the author makes reference to the four horsemen. These four drive prices down, not up and definitely take the profit from others.

The only competitors the four face are each other. They are also in race with each other to become the operating systems of our lives. And they never fight on other people’s terms.

The author describes the strengths of each four in the initial part of the book and the story behind their successes. It is easy to relate to the stories and anecdotes but the perspective from the author is both fresh and humorous.

Amazon, for instance, is said to increase its stock value at the price of decreasing stock value of every other retailer. This has been made possible with cheaper capital for a longer period of time than any firm in modern times.

With its momentum, Amazon has expanded and become a leader in cloud computing, delivery and fulfilment businesses. Amazon CEO is very daring. He can make a crazy idea practical. The floating warehouses are a bold example.

Google’s market capitalization of 773 billion is greater than the combined sum of the next big eight media companies. It bolsters confidence in its followers by demarcating the paid from the organic search results. And the followers keep growing with one in six queries have never been asked before. Facebook has been valued at 508 billion and is the single largest social media penetration to nearly 2.2 billion people. At the top of the layers depicting a marketing funnel, there is an awareness layer and Facebook has flooded it.

Famously Facebook can guess when you are in love by observing the number of timeline posts which increases when you are single and decreases when you are in relationship

Facebook is the source of news for over 67% of Americans. Google was up 60% and Facebook was up 43% when the rest of the digital advertisers were down 3%. Google and Facebook are known to be a duopoly. It may surprise us that while Twitter captures 82% of the post, Facebook captures 92% of the interactions on social media.

Apple controlled roughly 19 percent of the smartphone market capitalization but captured 87 percent of the global smartphone profits in 2016. Apple has single-handedly put luxury in everybody’s reach. Among the Four Horsemen, Apple seems to have the best genetics by thriving past the original founder and management team.

The source of wealth for the ten richest people in Europe seem to be Zara, L’Oreal, H&M, LVMH, Nutella, Aldi, Lidl, Trader Joe’s, Luxottica, and Crate & Barrel. They represent luxury and retail more than any other industry and Apple has mastered the luxury brand.

Saturday, April 27, 2019

Sequences and translations to vectors

Vectors are very useful representations of entities in terms of limited chosen dimension. Sequences are formed from different elements but each sequence can be described by a vector. The choice of dimensions is helpful to imbue the vectors with some latent significance. When the sequences are uniformly mapped to vectors, they become easy to cluster.

Clusters help in finding ground of sequences. They represent the salient topics within the possible groups. This makes it efficient to determine hidden content with in sequences.

Vectors lend meaning not just by their dimensions but also by the weights matrix associated with them. A softmax classifier helps assign these weights.

Sequence to vectors can use a CBOW architecture that predicts a sequence based on the surrounding sequences and the skip gram that predicts the surrounding sequences based on the current sequence as long as sequences are treated as units that occur in tact together in a collection. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire sequence database where vw and vw' are input and output vector representations of sequence w in database W. This is a heavily optimized expression but the idea is that each sequence tries to predict every sequence around it. This results in what is called sequence embedding.

#codingexercise

Node GetSuccessor(Node root)

{

if (root == NULL) return root;

if (root.right)

{

Node current = root.right;

While(current && current.left)

Current = current.left;

Return current;

}

Node parent = root. parent;

While (parent && parent.right == root)

{

root = parent;

parent = parent.parent;

}

return parent;

}

Friday, April 26, 2019

Programmability of sequences
Data formats and Data operators make programmability easier with sequences. Data formats like xml and json have a language for specifying a search path. For example, json representation enables JMESPath (pronounced James Path) query where elements can be extracted and search can be specified via the search operator. These are primarily helpful in Application Programming Interfaces.
Data Operators like the standard query operators that include sorting operators, filtering operators, quantifier operators, projection operators, join operators, partitioning and grouping data operators, generation operators, equality operators, element operators, concatenation operators, and aggregation operators improve the data access and enable a variety of queries and data manipulation.
APIs and SDKs complete the programmability for consumption by applications. These include implementations specific to languages. The availability of SDK in the choice language of the programmer enables the applications to be written easily.
APIs and SDKs also make it easy for the application traffic to be identified, monitored and support troubleshooting with the help of apiKeys, request parameters, caller contexts, http proxy and such other techniques.
#codingexercise

Node GetPredecessor(Node root)
{
if (root == NULL) return root;
if (root.left )
{
Node current = root.left;
While (current && current.right)
Current = current.right;
Return current;
}
Node parent = GetParent(root);
While (parent && parent.left == root)
{
root = parent;
parent = GetParent(root, parent);
}
return parent;
}

Thursday, April 25, 2019

Kubernetes Secrets
Kubernetes is a system for managing the containerized applications. It facilitates deployment and scaling. As part of deployment, applications have to set up accounts, passwords and other secrets. These secrets are necessary to be made available as files and environment variables for the deployment to go through.
The infrastructure provides a standard way of keeping these secrets. The secret is specified declaratively and then subsequently made available to the application. It can also be specified dynamically by the application code.
The secret is kept in the form of base64 encoded data. The secret can be written out a file on mounted volume for use by the application. In such case, it is preferable to mark the volume as read only. The secret can be accessible from any pod in the system
Since there is no limit to the size of the size of data stored as secrets almost any file can be base64 encoded and passed as secret. However, large files are generally not preferable as secrets instead they could be password protected and the passwords could become secrets.
There is no limit to the number of secrets that can be passed to the pods. However, it is better to enumerate them so that they remain in one place for consistent treatment.
Common forms of secrets involve account names, passwords, groups, identifiers, keys, certificates, keystores and truststores. These secrets can be at most a few thousand bytes in size. However, the type of secret determines its handling. Passwords can be managed in an external vault. Certificates, for example, can be managed by a certificate manager. Certificates can be from different issuers. ACME issuer supports certificates from its server. CA supports issuing certificates using a signing key pair. Vault supports issuing certificates using a common vault. Self-signed certificates are issued privately. Venafi certificates supports issuing certificate from a cloud or a platform instance.
Although Kubernets manages the secrets, a consolidator can help with specific secret types. The libraries for this such as cert-manager are quite popular and well documented. The use of libraries also brings down the code in the application to manage these specific types of secrets. The external dependencies for generating secrets are similar to any other dependency for the application code so these can be registered and maintained in one registry.
All external dependencies involve some maintenance as they go through their own revisions. However, an application deployment that is tested with a fixed type of dependencies will not need to revise its dependencies.

Get the preorder successor of a node. A successor is one that comes after the node in the preorder traversal)
Node GetPreorderSuccessor(Node root, Node x)
{
Node result = null;
if (root == null || x == null || x == root) return result;
var list = new List<Node>();
ToPreOrder(ref list, root);
int index = root.indexOf(ref list, x);
if ( index >= 0 && index < list.Count()-1)
return list[index+1];
return result;
}
static void ToPreOrder(ref List<Node> nodes, Node root)
{
if (root == null) return;
nodes.Add(root);
ToPreOrder(ref nodes, root.left);
ToPreOrder(ref nodes, root.right);
}

Wednesday, April 24, 2019

The case for data mining over sequences.
Large sets of sequences can be partitioned. When they are stored in horizontally partitioned tables, they can participate in data mining just the same as any other data.
While machine learning uses concepts such as supervised and unsupervised classifiers, it can be understood as a set of algorithms. Data Mining on the other hand uses those and other algorithms in conjunction with a database so that the data can be queried to yield the result set that summarizes the findings. These result sets can then be drawn on charts and represented on dashboards.
Yet data mining and machine learning are separate domains in themselves. Machine learning may find use with text analysis and images and other static data that is not represented in tables. Data Mining on the other than translates most data into something that can be stored in a database and this has worked well for organizations that want to safeguard their data. Moreover, we can view the difference as top down and bottoms up view as well. For example, when we use statistics for building a regression model, we are binding different parameters together to mean something together and tuning it with experimental data. An unsupervised machine learning algorithm on the other hand builds a decision tree classifier based on the data as it is made available. The output from a machine learning algorithm may be input for a data mining process. Some of the machine learning algorithms are forms of batch processing while data mining techniques may be applied in a streaming manner.
Both data mining and machine learning have been domain specific such as in finance, retail or telecommunications industry These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface. Among these visual data mining is popular for its designer style user interface that renders data, results and process in a graphical and usually interactive presentation.
Visualization tools such as grafana stack for viewing elaborate charts and eye candies only require read permissions on the data as they execute queries on the result to fetch the data for making the charts.
Some of the algorithms included with models and predictions used in data mining fall in the following categories:
Classification algorithms - for finding similar groups based on discrete variables
Regression algorithms - for finding statistical correlations on continuous variables from attributes
Segmentation algorithms - for dividing into groups with similar properties
Association algorithms - for finding correlations between different attributes in a data set
Sequence Analysis Algorithms - for finding groups via paths in sequences

Tuesday, April 23, 2019

The fixed length and varying length sequences

When the number of elements in a sequence does not change, it is a fixed length sequence. The elements of the sequence are similar to N-gram in natural language processing They come with the advantage that the storage can look up elements of the sequence based on position. This is helpful when combined with storing a sequence in the sorted order of elements .

While sequences are stored in a columnar manner with the entire sequence in a single column the fixed number of elements makes it easy to shred the sequence into a table This helps in improving the operations taken on the table. Previously, each sequence representing a string could only participate in query operations when they were indexed. The indexes on the sequences helped in fast lookup on the same table. If the table is joined to itself, it helps matching sequences to each other. With elements shredded into a table the sequences become far more efficient for querying
Sequences may also have group identifiers associated with them. With this organization, a group can help in finding a sequence and a sequence can help in finding an element. With the help of relations, we can perform standard query operations in a builder pattern.

This is still not amenable to querying as standard tables with known attributes for sequences. Only when the sequences is transformed into sparse table where all the possible elements appear as columns, it becomes easier to perform queries in the traditional well understood manner. Their representation now is similar to word vectors with limited number of dimensions

Another representation is in variable length record form where each sequence is a list of elements and the elements repeat across sequences. This representation helps sequences merging and splitting.

The techniques for reducing noise from sequences:

Monday, April 22, 2019

The techniques for reducing noise from sequences:
Any algorithm can create clusters from data. However, clusters are only as good when each cluster is cohesive, meaningful and separate from other clusters. There is a measure for the goodness of fit for clusters and this measure reduces the sum of square of errors. This gives a quantitative assessment of clusters.
Sequences behave very much the same way. Any set of sequences can be formed from a combination of elements. This explodes the number of sequences possible and without a quantitative measure of their usefulness, the sequences cannot be filtered. The presence of this measure enables sequences to be checked against a threshold that can separate the noise from the meaningful sequences as long as each sequence is given a value of this measure.
Sequences can also be clustered just like any other entities. The clustering of sequences helps in determining those that represent a cohesive property while outliers represent insignificance that can be ignored. With good clustering where the latent semantics of the sequences are included, the size and density of clusters represents the most significant collections. If the clustering technique were to simultaneously perform the representation of sequences to a vector and the clustering of these vectors, it may even result in a noise cluster that draws all the outliers into its own cluster. This enables cleaner formation of clusters with all most of the outliers in the noise cluster. The noise cluster can then be ignored.
Therefore, the usefulness of the sequences with a metric for each sequence as well as a choice of good distance metric, proper vectorization of sequences that brings out its latent meaning and a good clustering algorithm can efficiently remove noise from the overall large set of sequences that can be generated.
Another useful metric for this purpose is the F-score, which is a way to represent precision and recall. This gives precision measure as the ratio of successful classification to overall classifications resulting in selective labelling. The recall measure is given by the ratio of the successful classification to the actual number of sequences. The F-score ranks the classifier with the precision and recall taken together twice as a fraction of their sums This further improves the use of clusters to select the sequences.

Sunday, April 21, 2019

The serialization and deserialization of sequences

Sequences can be stored as literals requiring no serialization and deserialization. However associated data structures such as B-Tree and Radix Tree can be serialized and deserialized. We look at some of the usages here.

Range of sequences are efficiently represented in B-Trees. This data is generally not exported and kept local to each node. It is easy to send the data across on the wire as full representation of each sequence with additional metadata. The size of the sequence on the wire does not matter when the clients can take as much latency as necessary to transmit the sequence.

When large sets of sequences need to be transferred, they can be compressed in archives and sent across the wire. This too does not have to treat each sequence separately. However, when the archives start exceeding a threshold in size, they are no longer efficient to scale to the number of sequences. In such cases, efficient packing and unpacking become necessary.

Serializing and deserializing of data does have to be per sequence but packing and unpacking of data does not. There are techniques beyond compression algorithms for shortening sequences with the help of representations that can significantly improve the space used. If we group the sequences, then it is easy for more efficient techniques for storage which work by removing redundant elements and encoding them in a way that they can be deconstructed later.

Saturday, April 20, 2019

#codingexercise
AVL tree delete.
Node Remove(Node root, int k)
{
If (root ==null) return null;
If (k < root.data)
Root.left = Remove(root.left, k);
Else if (k > root.data)
Root.right = Remove(root.right, k);
Else
{
Node left = root.left;
Node right = root.right;
Delete root;
If (!right) return left;
Node min = FindMin(right);
Min.right = Removemin(min);
Min.left = left;
Return Balance(min);
}
Return Balance(root);
}
Node RemoveMin(Node root)
{
If (root.left)
Return root.right;
Root.left = RemoveMin(root.left);
Return Balance(root);
}
Node FindMin(Node root)
{
Return (root.left != null ? FindMin(root.left) : root;
}

Courtesy: https://kukuruku.co/hub/cpp/avl-trees,
Geeksforgeeks and my implementations using them

Friday, April 19, 2019

The predicates on sequences

Queries on sequences involve selection, projection and aggregation. In the case for selection, the number of sequences to select from, may be very large. A simple iteration-based selection itself can become very costly. Unless the range is restricted, selection can become sequential and slow. At the same time, partitioning the sequences can help parallelize the task. The degree of parallelism can be left to the system to decide based on range and the number of ranges assuming a fixed cost per range. Selection can be helpful if we can discount sequences that do not have a particular element. Bloom filters help with this purpose.

Queries for projection involve elements of the sequences or the attributes of the sequences. In these cases, it is helpful to use the elements as a column. Since the elements do not have to be limited to a small set, any collection to hold the elements encountered is sufficient. If the elements can be accessed by a well-known position in each sequence, it is helpful but this is rarely the case. Therefore, there is a transformation of a range of sequences in a matrix of elements which then makes it easier to operate on.

A reverse lookup based on the inverted list of elements helps when there is a limited number of elements to process. All standard query operations may be performed on the lists against the elements. This is useful for aggregations such as counting.

Aggregations can be performed using map-reduce method. Since they work on partitions, it is better than serial. Aggregations have the advantage that the results are smaller than the data so they consume less space.

Prefix trees help with sequences comparisons based on prefixes. Unlike joins, were the values have to match, prefix trees help with unrelated comparisons. Prefix trees also determine the levels of the match between the sequences and this is helpful to determine how close two sequences are. The distance between two sequences is the distance between the leaves of the prefix trees. This notion of similarity measure is also helpful to give a quantitative metric that can be used for clustering. Common techniques for clustering involve assigning sequences to the nearest cluster and forming cohesive cluster by reducing the sum of squares of errors.

#codingexercise

Node Remove(Node root, int k)

{

If (root ==null) return null;

If (k < root.data)

Root.left = Remove(root.left, k);

Else if (k > root.data)

Root.right = Remove(root.right, k);

Else

{

Node left = root.left;

Node right = root.right;

Delete root;

If (!right) return left;

Node min = FindMin(right);

Min.right = Removemin(min);

Min.left = left;

Return Balance(min);

}

Return Balance(root);

}