Cluster computing

Thursday, April 11, 2019

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

int binarySearch(List <integer> input, int start, int end, int val)
{
If (start > end) return –1;
int mid = (start + end)/2;
if (input[mid] == val) return mid;
if (start == end && input[mid] != val) return -1;
if (input[mid] < val)
return binarySearch(nums, mid+1, end, val);
else
return binarySearch(nums, start, mid, val);
}

Wednesday, April 10, 2019

Sequences:
We were discussing sequence databases. Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Choice of data structure assists with the processing of sequences. When sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage is better suited for making the sequences web accessible and iterable.

When sequences rules are discovered, they are listed one after the other. There is no effort to normalize them as they are inserted. The ability to canonicalize the groups can be taken on by background tasks. Groups and sequences tend to look very similar when they are repeatedly found and collected. The patterns may also be frequent. Groups may also vary very little from one to another. The prefix tree helps in determining these variations. Using a prefix tree is conveninent for the background tasks and the normalization can keep pace with the online insertions with little lag.
Together the online and offline data modifications may run only as part of an intermediate stage processing where preprocessing and postprocessing steps involve cleaning and prefix generation.

Sequence storage can enable different search operators to run against the same storage. If there is a narrow range of sequences that are to be evaluated, even a LogParser like utility can enable SQL commands to be run against the sequences.

Sequence storage in streams can work with stream insights – a stream processing package. Any stream analysis software can be used. Sequences are not just about scalar elements. Each element may be better understood as a vector with latent information and representation in terms of associations with other elements. When the sequences form pseudo elements, they have a collective representation as a vector with latent information. The dimensions of the vector may be fixed but the sequences can have arbitrary number of elements. While we allow this number to vary we prefer not to let the number exceed a limit so that sequences are manageable. If a sequence is very large and occurs frequently, we can consider it as a combination of sequences with maximum number of elements in each. This helps keep the sequence lengths bounded and helps move the choice of sequences into admission control.
#codingexercise
Sort integers of the array A1 by those of A2.

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

Tuesday, April 9, 2019

Sequences and Groups:
Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Sequences and groups are very similar. Groups are unordered collection of elements whereas the sequences have an order. We use them interchangeably unless specifically calling out either for the absence or presence of ordering. Groups are efficiently represented in terms of elements and their tags. A table of elements can have a group identifier and the groups become easy to form by finding elements with the same group id. This works well for small number of entries. It doesn’t for large number of entries.
If the groups were limited size, we could store the elements along the columns and use a boolean to represent associativity in a group. However, groups may not always have the same elements. These groups then become variable length records.
Instead each group can be considered a string representation of a pseudo entity and stored the same way as entities. However, associated with sequences, we may have a prefix tree to determine similar groups based on prefixes. Bloom filters may help determine sets of groups.
Therefore, choice of data structure assists with the processing. When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage has better implementation of storage best practice.

Monday, April 8, 2019

Uses of object storage with Sequences:

Sequence databases are niche storage. They do not find everyday use in commercial systems because traditional relational and non-relational databases provide the ability to store large sets of data and their indexes. Their availability in the cloud has allowed the notion of unlimited data in representations such as BigTable. Sequences however do not expand along the columns of a table. Instead they run to the order of billions in the number of rows.

The processing for large sets of rows has remained somewhat similar so traditional databases served for all data including sequences with their tables. Sequence processing however has deviated from this conventional analytical stack. Sequences involve prefix tree. The processing stages tend to prune, clean and perform canonicalization before the sequence patterns are discovered. Even a dynamic bit vector datastructure or a bloom filter is used to determine whether an element is part of a sequence or not.

Generation of sequences is also a multi-stage processing. It involves discovering elements for the sequences prior to collecting the sequences. Such extraction of elements requires cleaning, stemming and even running neural nets so that they can be weighted before they are extracted. Sequences merely help with the formation of groups. Sometimes the ordering is important and at other times they degenerate to groups.

Groups have had limited application in analysis because groups proliferated and there is no good way to determine what is important and what isn’t. There is also no easy way to tell how many elements should remain in a group or what to exclude. This makes groups difficult for analysis as opposed to vectorization of elements where their latent power and associations are easier to form patterns with data mining techniques. Statistical and other forms of analysis also prefer vectorization. Graphs and page ranks also work better on elements that are vectors rather than scalars.

However, groups do have the ability to form pseudo elements and these elements can also participate in the formation of graphs and their analysis via page ranks. The use of a search engine with the web resources is a demonstration that page ranking can assign weights to elements. With the help of groups as pseudo elements, there is some categorization which can lead to hierarchies or levels. These hierarchies or levels add value to the otherwise flat representation of resource rankings where only the top few ever get noticed and the remaining ignored.

The meaningfulness of unordered groups or ordered sequences can improve the search as well as the prediction of queries. This has been the underlying basis for collaborative filtering where users are genuinely interested in viewing items that others have viewed similar to what they were trying to find. Therefore, groups and sequences hold a lot of promise.

Object storage is a representation of infinite web accessible storage of key value collections in a hierarchical namespace that is well suited for groups and collections. With the compute resources available to access the storage directly over the web and the object storage demonstrating the best practice of the storage industry, the analysis using groups and sequences becomes much more agile with such storage and compute.

Sunday, April 7, 2019

Today we take a break from discussing the best practice from storage engineering.

Sequence Analysis:
Data is increasing more than ever and at a fast pace. Algorithms for data analysis are required to become more robust, efficient and accurate. Specializations in databases, higher end processors suitable for artificial intelligence have contributed to improvements in data analysis. Data mining techniques discover patterns in the data and are useful for predictions but they tend to require traditional databases.
Sequence databases are highly specialized and even though they can be supported by B-Tree data structure that many contemporary databases use, they tend to be larger than many commercial databases. In addition, algorithms for mining non-sequential rules focus on generating all sequential rules. These algorithms produce an enormous number of redundant rules. The large number not only makes mining inefficient, it also hampers iterations. Such algorithms depend on patterns obtained from earlier frequent pattern mining algorithms. However, if the rules are normalized and redundancies removed, they become efficient to be stored and used with a sequence database.
The data structures used for sequence rules have evolved. The use of a dynamic bit vector data structure is now an alternative. The data mining process involves a prefix tree. Early data processing stages tend to prune, clean and perform canonicalization and these have reduced the rules.
In the context of text mining, sequences have had limited applications because the ordering of words has never been important for determining the topics. However, salient keywords regardless of their locations and coherent enough for a topic tend to form sequences rather than groups. Finding semantic information with word vectors does not help with this ordering. They are two independent variables. And the word vectors are formed only with a predetermined set of dimensions. These dimensions do not increase significantly with progressive text. There is no method for vectors to be redefined with increasing dimensions as text progresses.
The number and scale of dynamic groups of word vectors can be arbitrarily large. The ordering of the words can remain alphabetical. These words can then map to a word vector table where the features are predetermined giving the table a rather fixed number of columns instead of leaving it to be a big table.
Since there is a lot of content with similar usage of words and phrases and almost everyone uses the language in day to day simple English, there is a higher likelihood that some of these groups will stand out in terms of usage and frequency. When we have exhaustively collected and persisted frequently co-occuring words in a groupopedia as interpreted from large corpus with no limit to the number of words in a group and the groups-persisted in the sorted order of their frequencies, then we have a two-fold ability to shred a given text into pre-determined groups there-by instantly recognizing topics and secondly adding to pseudo word vectors where groups translate as vectors in the vector table.

#codingexercise
Yesterday's coding question continued:
Since we have A2 as a small array, we can directly use it to sort the elements
public List <Integer> relativeSort (List <Integer> A1, List <Integer> A2) {
List <Integer> result = new ArrayList <>();
int start = 0;
for (Integer a: A2) {
for (Integer index = findFirst (A1, a, start); start < A1.size () && index != -1; start = index +1) {
result.add (A1 [index]);
}
start = 0;
}
if ( result.size () < A1.size () ) {
result.addAll (A1.getRange (result.size ()).sort ());
}
return result;
}

public int findFirst (List <int> A1, int a, int start) {
for ( int i = start; i < A1.size (); i++) {
if ( A1 [i] == a ) {
return i;
}
}
return -1;
}