Cluster computing

Thursday, April 11, 2019

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

int binarySearch(List <integer> input, int start, int end, int val)
{
If (start > end) return –1;
int mid = (start + end)/2;
if (input[mid] == val) return mid;
if (start == end && input[mid] != val) return -1;
if (input[mid] < val)
return binarySearch(nums, mid+1, end, val);
else
return binarySearch(nums, start, mid, val);
}

Wednesday, April 10, 2019

Sequences:
We were discussing sequence databases. Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Choice of data structure assists with the processing of sequences. When sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage is better suited for making the sequences web accessible and iterable.

When sequences rules are discovered, they are listed one after the other. There is no effort to normalize them as they are inserted. The ability to canonicalize the groups can be taken on by background tasks. Groups and sequences tend to look very similar when they are repeatedly found and collected. The patterns may also be frequent. Groups may also vary very little from one to another. The prefix tree helps in determining these variations. Using a prefix tree is conveninent for the background tasks and the normalization can keep pace with the online insertions with little lag.
Together the online and offline data modifications may run only as part of an intermediate stage processing where preprocessing and postprocessing steps involve cleaning and prefix generation.

Sequence storage can enable different search operators to run against the same storage. If there is a narrow range of sequences that are to be evaluated, even a LogParser like utility can enable SQL commands to be run against the sequences.

Sequence storage in streams can work with stream insights – a stream processing package. Any stream analysis software can be used. Sequences are not just about scalar elements. Each element may be better understood as a vector with latent information and representation in terms of associations with other elements. When the sequences form pseudo elements, they have a collective representation as a vector with latent information. The dimensions of the vector may be fixed but the sequences can have arbitrary number of elements. While we allow this number to vary we prefer not to let the number exceed a limit so that sequences are manageable. If a sequence is very large and occurs frequently, we can consider it as a combination of sequences with maximum number of elements in each. This helps keep the sequence lengths bounded and helps move the choice of sequences into admission control.
#codingexercise
Sort integers of the array A1 by those of A2.

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

Tuesday, April 9, 2019

Sequences and Groups:
Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Sequences and groups are very similar. Groups are unordered collection of elements whereas the sequences have an order. We use them interchangeably unless specifically calling out either for the absence or presence of ordering. Groups are efficiently represented in terms of elements and their tags. A table of elements can have a group identifier and the groups become easy to form by finding elements with the same group id. This works well for small number of entries. It doesn’t for large number of entries.
If the groups were limited size, we could store the elements along the columns and use a boolean to represent associativity in a group. However, groups may not always have the same elements. These groups then become variable length records.
Instead each group can be considered a string representation of a pseudo entity and stored the same way as entities. However, associated with sequences, we may have a prefix tree to determine similar groups based on prefixes. Bloom filters may help determine sets of groups.
Therefore, choice of data structure assists with the processing. When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage has better implementation of storage best practice.

Monday, April 8, 2019

Uses of object storage with Sequences:

Sequence databases are niche storage. They do not find everyday use in commercial systems because traditional relational and non-relational databases provide the ability to store large sets of data and their indexes. Their availability in the cloud has allowed the notion of unlimited data in representations such as BigTable. Sequences however do not expand along the columns of a table. Instead they run to the order of billions in the number of rows.

The processing for large sets of rows has remained somewhat similar so traditional databases served for all data including sequences with their tables. Sequence processing however has deviated from this conventional analytical stack. Sequences involve prefix tree. The processing stages tend to prune, clean and perform canonicalization before the sequence patterns are discovered. Even a dynamic bit vector datastructure or a bloom filter is used to determine whether an element is part of a sequence or not.

Generation of sequences is also a multi-stage processing. It involves discovering elements for the sequences prior to collecting the sequences. Such extraction of elements requires cleaning, stemming and even running neural nets so that they can be weighted before they are extracted. Sequences merely help with the formation of groups. Sometimes the ordering is important and at other times they degenerate to groups.

Groups have had limited application in analysis because groups proliferated and there is no good way to determine what is important and what isn’t. There is also no easy way to tell how many elements should remain in a group or what to exclude. This makes groups difficult for analysis as opposed to vectorization of elements where their latent power and associations are easier to form patterns with data mining techniques. Statistical and other forms of analysis also prefer vectorization. Graphs and page ranks also work better on elements that are vectors rather than scalars.

However, groups do have the ability to form pseudo elements and these elements can also participate in the formation of graphs and their analysis via page ranks. The use of a search engine with the web resources is a demonstration that page ranking can assign weights to elements. With the help of groups as pseudo elements, there is some categorization which can lead to hierarchies or levels. These hierarchies or levels add value to the otherwise flat representation of resource rankings where only the top few ever get noticed and the remaining ignored.

The meaningfulness of unordered groups or ordered sequences can improve the search as well as the prediction of queries. This has been the underlying basis for collaborative filtering where users are genuinely interested in viewing items that others have viewed similar to what they were trying to find. Therefore, groups and sequences hold a lot of promise.

Object storage is a representation of infinite web accessible storage of key value collections in a hierarchical namespace that is well suited for groups and collections. With the compute resources available to access the storage directly over the web and the object storage demonstrating the best practice of the storage industry, the analysis using groups and sequences becomes much more agile with such storage and compute.