Cluster computing

Sunday, April 21, 2019

The serialization and deserialization of sequences

Sequences can be stored as literals requiring no serialization and deserialization. However associated data structures such as B-Tree and Radix Tree can be serialized and deserialized. We look at some of the usages here.

Range of sequences are efficiently represented in B-Trees. This data is generally not exported and kept local to each node. It is easy to send the data across on the wire as full representation of each sequence with additional metadata. The size of the sequence on the wire does not matter when the clients can take as much latency as necessary to transmit the sequence.

When large sets of sequences need to be transferred, they can be compressed in archives and sent across the wire. This too does not have to treat each sequence separately. However, when the archives start exceeding a threshold in size, they are no longer efficient to scale to the number of sequences. In such cases, efficient packing and unpacking become necessary.

Serializing and deserializing of data does have to be per sequence but packing and unpacking of data does not. There are techniques beyond compression algorithms for shortening sequences with the help of representations that can significantly improve the space used. If we group the sequences, then it is easy for more efficient techniques for storage which work by removing redundant elements and encoding them in a way that they can be deconstructed later.

Saturday, April 20, 2019

#codingexercise
AVL tree delete.
Node Remove(Node root, int k)
{
If (root ==null) return null;
If (k < root.data)
Root.left = Remove(root.left, k);
Else if (k > root.data)
Root.right = Remove(root.right, k);
Else
{
Node left = root.left;
Node right = root.right;
Delete root;
If (!right) return left;
Node min = FindMin(right);
Min.right = Removemin(min);
Min.left = left;
Return Balance(min);
}
Return Balance(root);
}
Node RemoveMin(Node root)
{
If (root.left)
Return root.right;
Root.left = RemoveMin(root.left);
Return Balance(root);
}
Node FindMin(Node root)
{
Return (root.left != null ? FindMin(root.left) : root;
}

Courtesy: https://kukuruku.co/hub/cpp/avl-trees,
Geeksforgeeks and my implementations using them

Friday, April 19, 2019

The predicates on sequences

Queries on sequences involve selection, projection and aggregation. In the case for selection, the number of sequences to select from, may be very large. A simple iteration-based selection itself can become very costly. Unless the range is restricted, selection can become sequential and slow. At the same time, partitioning the sequences can help parallelize the task. The degree of parallelism can be left to the system to decide based on range and the number of ranges assuming a fixed cost per range. Selection can be helpful if we can discount sequences that do not have a particular element. Bloom filters help with this purpose.

Queries for projection involve elements of the sequences or the attributes of the sequences. In these cases, it is helpful to use the elements as a column. Since the elements do not have to be limited to a small set, any collection to hold the elements encountered is sufficient. If the elements can be accessed by a well-known position in each sequence, it is helpful but this is rarely the case. Therefore, there is a transformation of a range of sequences in a matrix of elements which then makes it easier to operate on.

A reverse lookup based on the inverted list of elements helps when there is a limited number of elements to process. All standard query operations may be performed on the lists against the elements. This is useful for aggregations such as counting.

Aggregations can be performed using map-reduce method. Since they work on partitions, it is better than serial. Aggregations have the advantage that the results are smaller than the data so they consume less space.

Prefix trees help with sequences comparisons based on prefixes. Unlike joins, were the values have to match, prefix trees help with unrelated comparisons. Prefix trees also determine the levels of the match between the sequences and this is helpful to determine how close two sequences are. The distance between two sequences is the distance between the leaves of the prefix trees. This notion of similarity measure is also helpful to give a quantitative metric that can be used for clustering. Common techniques for clustering involve assigning sequences to the nearest cluster and forming cohesive cluster by reducing the sum of squares of errors.

#codingexercise

Node Remove(Node root, int k)

{

If (root ==null) return null;

If (k < root.data)

Root.left = Remove(root.left, k);

Else if (k > root.data)

Root.right = Remove(root.right, k);

Else

{

Node left = root.left;

Node right = root.right;

Delete root;

If (!right) return left;

Node min = FindMin(right);

Min.right = Removemin(min);

Min.left = left;

Return Balance(min);

}

Return Balance(root);

}

Thursday, April 18, 2019

The sequence joins

Unlike relational tables, sequences are generally listed in a columnar manner. Since each sequence represents a string, the indexes on the sequences help in fast lookup on the same table. If the table is joined to itself, it helps matching sequences to each other.
Prefix trees help with sequences comparisons based on prefixes. Unlike joins, were the values have to match, prefix trees help with unrelated comparisons. Prefix trees also determine the levels of the match between the sequences and this is helpful to determine how close two sequences are. The distance between two sequences is the distance between the leaves of the prefix trees. This notion of similarity measure is also helpful to give a quantitative metric that can be used for clustering.
Common techniques for clustering involve assigning sequences to the nearest cluster and forming cohesive cluster by reducing the sum of squares of errors.

Besides these usual forms of representation, there is nothing preventing breaking up the sequences into an elements table with relation to the sequence table. Similarly, sequences may also have group identifiers associated with them. With this organization, a group can help in finding a sequence and a sequence can help in finding an element. With the help of relations, we can perform standard query operations in a builder pattern.

Wednesday, April 17, 2019

The metrics for Sequence analysis:

Sequences can be generated in large numbers. Their processing can take arbitrary time. Therefore, there is a need to monitor and report progress on all activities associated with sequences.

Metrics for sequences can be duration based which includes elapsed time. If there are a million records and clustering them takes more time with one algorithm than other, elapsed time can help determine the right choice.

Metrics for sequences can also include count of sequences. If the number of sequences processed stalls or they are processed far too quickly for results, then they refer to some inconsistency. The metrics in this case helps to diagnose and troubleshoot.

Metrics can also be scoped to partitions while global ones are maintained separately. Metrics can also tags or namespaces associated with the same physical resource.

Metrics can support a variety of aggregations such as sum(), average() and so on. These can be executed at different scopes as well as globally. Metrics may be passed as parameters in the form of time series array.

When sequences rules are discovered, they are listed one after the other. There is no effort to normalize them as they are inserted. The ability to canonicalize the groups can be taken on by background tasks. Together the online and offline data modifications may run only as part of an intermediate stage processing where preprocessing and postprocessing steps involve cleaning and prefix generation. Metrics give visibility to these operations.

Tuesday, April 16, 2019

The reporting on Sequence Analysis:
Sequences analysis follows similar channels of reporting as any other data store. The online transactional aspect of reporting is separated from read only reporting stacks. Unless the reporting requires temporary or permanent storage, it can pretty much be read only. Reporting stacks for sequences can form good charts and graphs by virtue of the stack they use. Most stacks provide conventional forms of representing trends and patterns while some go the extra length to show case cloudmap charts and interactive visualizations including drilldowns.
Sequence visualizations also need to support sequence legends. The sequences don't have aliases or identifiers so they tend to clutter up the chart or the graph. When the aliases are represented via legends, the size and number of entries in the legend grows considerably. Therefore, they need to be stored separate from the chart or graph.
Sequences are formed from elements. These elements are also of interest as they tend to repeat across sequences. Any visualization targeting the sequence becomes more interesting when it can show patterns in elements also. This overlay of charts is not typical for many other plots. The sequence patterns and trends can be shown even in a custom manner via programming stacks such as JavaScript and JQuery.

Sequences are formed from elements. These elements are also of interest as they tend to repeat across sequences. Any visualization targeting the sequence becomes more interesting when it can show patterns in elements also. This overlay of charts is not typical for many other plots. The sequence patterns and trends can be shown even in a custom manner via programming stacks such as JavaScript and JQuery.

While element and sequence patterns are interesting, they may find their way into separate charts. It is not necessary to overlay them on top of each other in all cases and keeping them side by side allows the user to go back and forth between the two

Monday, April 15, 2019

The unification of processing over sequences:
While we have elaborated on dedicated storage types for sequences, processing over these storage types has merely been enumerated as sequential, batch, or stream oriented processing depending on the data store. However, in this section, we would like to say that processing is for analytics and it is not easy to confine them to specific object storages. Analytical packages like Apache FLink are tending towards combining options for processing while remaining a champion of say stream processing.
There are a few advantages to this scenario for the developers engaging in analysis. First, they have a broad range of capabilities handy from the same package. The code that they write is more maintainable, written once and targets all the capabilities via a single package. Second, the package decouples the processing from the storage concerns allowing algorithms to change for the same strategy and the same data set. Third, it is easier for the developers to target the data set with the same package if they do not have to concern themselves about scaling to large data sizes. When the data sizes inrease to orders of magnitude, code to process the sequences changes considerably but if the onus is taken by a package rather than the custom code from the developer, then it improves considerably requiring little or no attention for the future. Finally, the packages are themselves are tried and tested with integration to other packages or storage products that serve tier2 storage over stream, blobs, files or blocks. This makes it more appealing to use the same package for a variety of purposes. Open source packages have demonstrated code reusability far more than buy your own options. Code written against open source packages is also suitable to code being published to other
When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits