Cluster computing

Thursday, April 25, 2019

Kubernetes Secrets
Kubernetes is a system for managing the containerized applications. It facilitates deployment and scaling. As part of deployment, applications have to set up accounts, passwords and other secrets. These secrets are necessary to be made available as files and environment variables for the deployment to go through.
The infrastructure provides a standard way of keeping these secrets. The secret is specified declaratively and then subsequently made available to the application. It can also be specified dynamically by the application code.
The secret is kept in the form of base64 encoded data. The secret can be written out a file on mounted volume for use by the application. In such case, it is preferable to mark the volume as read only. The secret can be accessible from any pod in the system
Since there is no limit to the size of the size of data stored as secrets almost any file can be base64 encoded and passed as secret. However, large files are generally not preferable as secrets instead they could be password protected and the passwords could become secrets.
There is no limit to the number of secrets that can be passed to the pods. However, it is better to enumerate them so that they remain in one place for consistent treatment.
Common forms of secrets involve account names, passwords, groups, identifiers, keys, certificates, keystores and truststores. These secrets can be at most a few thousand bytes in size. However, the type of secret determines its handling. Passwords can be managed in an external vault. Certificates, for example, can be managed by a certificate manager. Certificates can be from different issuers. ACME issuer supports certificates from its server. CA supports issuing certificates using a signing key pair. Vault supports issuing certificates using a common vault. Self-signed certificates are issued privately. Venafi certificates supports issuing certificate from a cloud or a platform instance.
Although Kubernets manages the secrets, a consolidator can help with specific secret types. The libraries for this such as cert-manager are quite popular and well documented. The use of libraries also brings down the code in the application to manage these specific types of secrets. The external dependencies for generating secrets are similar to any other dependency for the application code so these can be registered and maintained in one registry.
All external dependencies involve some maintenance as they go through their own revisions. However, an application deployment that is tested with a fixed type of dependencies will not need to revise its dependencies.

Get the preorder successor of a node. A successor is one that comes after the node in the preorder traversal)
Node GetPreorderSuccessor(Node root, Node x)
{
Node result = null;
if (root == null || x == null || x == root) return result;
var list = new List<Node>();
ToPreOrder(ref list, root);
int index = root.indexOf(ref list, x);
if ( index >= 0 && index < list.Count()-1)
return list[index+1];
return result;
}
static void ToPreOrder(ref List<Node> nodes, Node root)
{
if (root == null) return;
nodes.Add(root);
ToPreOrder(ref nodes, root.left);
ToPreOrder(ref nodes, root.right);
}

Wednesday, April 24, 2019

The case for data mining over sequences.
Large sets of sequences can be partitioned. When they are stored in horizontally partitioned tables, they can participate in data mining just the same as any other data.
While machine learning uses concepts such as supervised and unsupervised classifiers, it can be understood as a set of algorithms. Data Mining on the other hand uses those and other algorithms in conjunction with a database so that the data can be queried to yield the result set that summarizes the findings. These result sets can then be drawn on charts and represented on dashboards.
Yet data mining and machine learning are separate domains in themselves. Machine learning may find use with text analysis and images and other static data that is not represented in tables. Data Mining on the other than translates most data into something that can be stored in a database and this has worked well for organizations that want to safeguard their data. Moreover, we can view the difference as top down and bottoms up view as well. For example, when we use statistics for building a regression model, we are binding different parameters together to mean something together and tuning it with experimental data. An unsupervised machine learning algorithm on the other hand builds a decision tree classifier based on the data as it is made available. The output from a machine learning algorithm may be input for a data mining process. Some of the machine learning algorithms are forms of batch processing while data mining techniques may be applied in a streaming manner.
Both data mining and machine learning have been domain specific such as in finance, retail or telecommunications industry These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface. Among these visual data mining is popular for its designer style user interface that renders data, results and process in a graphical and usually interactive presentation.
Visualization tools such as grafana stack for viewing elaborate charts and eye candies only require read permissions on the data as they execute queries on the result to fetch the data for making the charts.
Some of the algorithms included with models and predictions used in data mining fall in the following categories:
Classification algorithms - for finding similar groups based on discrete variables
Regression algorithms - for finding statistical correlations on continuous variables from attributes
Segmentation algorithms - for dividing into groups with similar properties
Association algorithms - for finding correlations between different attributes in a data set
Sequence Analysis Algorithms - for finding groups via paths in sequences

Tuesday, April 23, 2019

The fixed length and varying length sequences

When the number of elements in a sequence does not change, it is a fixed length sequence. The elements of the sequence are similar to N-gram in natural language processing They come with the advantage that the storage can look up elements of the sequence based on position. This is helpful when combined with storing a sequence in the sorted order of elements .

While sequences are stored in a columnar manner with the entire sequence in a single column the fixed number of elements makes it easy to shred the sequence into a table This helps in improving the operations taken on the table. Previously, each sequence representing a string could only participate in query operations when they were indexed. The indexes on the sequences helped in fast lookup on the same table. If the table is joined to itself, it helps matching sequences to each other. With elements shredded into a table the sequences become far more efficient for querying
Sequences may also have group identifiers associated with them. With this organization, a group can help in finding a sequence and a sequence can help in finding an element. With the help of relations, we can perform standard query operations in a builder pattern.

This is still not amenable to querying as standard tables with known attributes for sequences. Only when the sequences is transformed into sparse table where all the possible elements appear as columns, it becomes easier to perform queries in the traditional well understood manner. Their representation now is similar to word vectors with limited number of dimensions

Another representation is in variable length record form where each sequence is a list of elements and the elements repeat across sequences. This representation helps sequences merging and splitting.

The techniques for reducing noise from sequences:

Monday, April 22, 2019

The techniques for reducing noise from sequences:
Any algorithm can create clusters from data. However, clusters are only as good when each cluster is cohesive, meaningful and separate from other clusters. There is a measure for the goodness of fit for clusters and this measure reduces the sum of square of errors. This gives a quantitative assessment of clusters.
Sequences behave very much the same way. Any set of sequences can be formed from a combination of elements. This explodes the number of sequences possible and without a quantitative measure of their usefulness, the sequences cannot be filtered. The presence of this measure enables sequences to be checked against a threshold that can separate the noise from the meaningful sequences as long as each sequence is given a value of this measure.
Sequences can also be clustered just like any other entities. The clustering of sequences helps in determining those that represent a cohesive property while outliers represent insignificance that can be ignored. With good clustering where the latent semantics of the sequences are included, the size and density of clusters represents the most significant collections. If the clustering technique were to simultaneously perform the representation of sequences to a vector and the clustering of these vectors, it may even result in a noise cluster that draws all the outliers into its own cluster. This enables cleaner formation of clusters with all most of the outliers in the noise cluster. The noise cluster can then be ignored.
Therefore, the usefulness of the sequences with a metric for each sequence as well as a choice of good distance metric, proper vectorization of sequences that brings out its latent meaning and a good clustering algorithm can efficiently remove noise from the overall large set of sequences that can be generated.
Another useful metric for this purpose is the F-score, which is a way to represent precision and recall. This gives precision measure as the ratio of successful classification to overall classifications resulting in selective labelling. The recall measure is given by the ratio of the successful classification to the actual number of sequences. The F-score ranks the classifier with the precision and recall taken together twice as a fraction of their sums This further improves the use of clusters to select the sequences.

Sunday, April 21, 2019

The serialization and deserialization of sequences

Sequences can be stored as literals requiring no serialization and deserialization. However associated data structures such as B-Tree and Radix Tree can be serialized and deserialized. We look at some of the usages here.

Range of sequences are efficiently represented in B-Trees. This data is generally not exported and kept local to each node. It is easy to send the data across on the wire as full representation of each sequence with additional metadata. The size of the sequence on the wire does not matter when the clients can take as much latency as necessary to transmit the sequence.

When large sets of sequences need to be transferred, they can be compressed in archives and sent across the wire. This too does not have to treat each sequence separately. However, when the archives start exceeding a threshold in size, they are no longer efficient to scale to the number of sequences. In such cases, efficient packing and unpacking become necessary.

Serializing and deserializing of data does have to be per sequence but packing and unpacking of data does not. There are techniques beyond compression algorithms for shortening sequences with the help of representations that can significantly improve the space used. If we group the sequences, then it is easy for more efficient techniques for storage which work by removing redundant elements and encoding them in a way that they can be deconstructed later.

Saturday, April 20, 2019

#codingexercise
AVL tree delete.
Node Remove(Node root, int k)
{
If (root ==null) return null;
If (k < root.data)
Root.left = Remove(root.left, k);
Else if (k > root.data)
Root.right = Remove(root.right, k);
Else
{
Node left = root.left;
Node right = root.right;
Delete root;
If (!right) return left;
Node min = FindMin(right);
Min.right = Removemin(min);
Min.left = left;
Return Balance(min);
}
Return Balance(root);
}
Node RemoveMin(Node root)
{
If (root.left)
Return root.right;
Root.left = RemoveMin(root.left);
Return Balance(root);
}
Node FindMin(Node root)
{
Return (root.left != null ? FindMin(root.left) : root;
}

Courtesy: https://kukuruku.co/hub/cpp/avl-trees,
Geeksforgeeks and my implementations using them

Friday, April 19, 2019

The predicates on sequences

Queries on sequences involve selection, projection and aggregation. In the case for selection, the number of sequences to select from, may be very large. A simple iteration-based selection itself can become very costly. Unless the range is restricted, selection can become sequential and slow. At the same time, partitioning the sequences can help parallelize the task. The degree of parallelism can be left to the system to decide based on range and the number of ranges assuming a fixed cost per range. Selection can be helpful if we can discount sequences that do not have a particular element. Bloom filters help with this purpose.

Queries for projection involve elements of the sequences or the attributes of the sequences. In these cases, it is helpful to use the elements as a column. Since the elements do not have to be limited to a small set, any collection to hold the elements encountered is sufficient. If the elements can be accessed by a well-known position in each sequence, it is helpful but this is rarely the case. Therefore, there is a transformation of a range of sequences in a matrix of elements which then makes it easier to operate on.

A reverse lookup based on the inverted list of elements helps when there is a limited number of elements to process. All standard query operations may be performed on the lists against the elements. This is useful for aggregations such as counting.

Aggregations can be performed using map-reduce method. Since they work on partitions, it is better than serial. Aggregations have the advantage that the results are smaller than the data so they consume less space.

Prefix trees help with sequences comparisons based on prefixes. Unlike joins, were the values have to match, prefix trees help with unrelated comparisons. Prefix trees also determine the levels of the match between the sequences and this is helpful to determine how close two sequences are. The distance between two sequences is the distance between the leaves of the prefix trees. This notion of similarity measure is also helpful to give a quantitative metric that can be used for clustering. Common techniques for clustering involve assigning sequences to the nearest cluster and forming cohesive cluster by reducing the sum of squares of errors.

#codingexercise

Node Remove(Node root, int k)

{

If (root ==null) return null;

If (k < root.data)

Root.left = Remove(root.left, k);

Else if (k > root.data)

Root.right = Remove(root.right, k);

Else

{

Node left = root.left;

Node right = root.right;

Delete root;

If (!right) return left;

Node min = FindMin(right);

Min.right = Removemin(min);

Min.left = left;

Return Balance(min);

}

Return Balance(root);

}