Cluster computing

Wednesday, April 10, 2019

Sequences:
We were discussing sequence databases. Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Choice of data structure assists with the processing of sequences. When sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage is better suited for making the sequences web accessible and iterable.

When sequences rules are discovered, they are listed one after the other. There is no effort to normalize them as they are inserted. The ability to canonicalize the groups can be taken on by background tasks. Groups and sequences tend to look very similar when they are repeatedly found and collected. The patterns may also be frequent. Groups may also vary very little from one to another. The prefix tree helps in determining these variations. Using a prefix tree is conveninent for the background tasks and the normalization can keep pace with the online insertions with little lag.
Together the online and offline data modifications may run only as part of an intermediate stage processing where preprocessing and postprocessing steps involve cleaning and prefix generation.

Sequence storage can enable different search operators to run against the same storage. If there is a narrow range of sequences that are to be evaluated, even a LogParser like utility can enable SQL commands to be run against the sequences.

Sequence storage in streams can work with stream insights – a stream processing package. Any stream analysis software can be used. Sequences are not just about scalar elements. Each element may be better understood as a vector with latent information and representation in terms of associations with other elements. When the sequences form pseudo elements, they have a collective representation as a vector with latent information. The dimensions of the vector may be fixed but the sequences can have arbitrary number of elements. While we allow this number to vary we prefer not to let the number exceed a limit so that sequences are manageable. If a sequence is very large and occurs frequently, we can consider it as a combination of sequences with maximum number of elements in each. This helps keep the sequence lengths bounded and helps move the choice of sequences into admission control.
#codingexercise
Sort integers of the array A1 by those of A2.

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

Cluster computing

Wednesday, April 10, 2019

No comments:

Post a Comment