Cluster computing

Sunday, April 14, 2019

Given an integer array, generate two adjacent subsequences, if possible, where one is strictly increasing and another is strictly decreasing.

Pair<List<Integer>, List<Integer>> getSubsequences(List<Integer> input) {

for (int I = 0; I < input.size(); I++) {

List<Integer> before = input.subList(0, i);

List<Integer> after = input.subList(i+1, input.size()-1);

if (isIncreasing(before) && isDecreasing(after)) {

return new Pair<List<Integer>, List<integer>(

input.subList(0, i),

input.subList(i+1, input.size()-1));

}

Return null;

}

boolean isIncreasing(List<Integer> input) {

List<integer> sorted = new ArrayList<Integer>(input);

sorted.sort();

Return sorted.equals(input);

}

Boolean isDecreasing(List<Integer> input) {

List<Integer> reversed = new ArrayList<>(Lists.reverse(input)); Return IsIncreasing(reversed); }

If the subsequences don’t have to be adjacent, we can a subsequence to be within the other. In such case, getSubsequences would expand to include the discovered subsequence and the remainder.

For (int i = 0; I < input.size(); I++)

{

For (int j = I+1; j < input.size(); j++)

{

List<Integer> section = input.subList(I,j);

If (isIncreasing(section) {

List<Integer> remainder = new ArrayList<>();

remainder.addAll(input.subList(0,I));

remainder.addAll(input.subList(j+1,input.size()-1));

if (isDecreasing(remainder)) {

return new Pair<List<Integer>, List<Integer>>(section, remainder);

}

Saturday, April 13, 2019

The transformation of sequences:

In the sections following this one, we are describing storage and querying for sequences. However, transfer to storage is not always a linear online data access. Sequences tend to be processed, pruned, cleaned and deduplicated before they arrive into the storage. Those systems that handle this pre-storage processing may choose to supply the data in batches and with Extract-Transform-Load kind of operations. We look at some of these transformations first prior to the storage in a particular format.

The transformations are primarily between the forms of lists or prefix-trees. The lists hold independent entries and the prefix-trees holds organizations based on prefixes. The prefix trees are useful for finding similar sequences. The lists also allow grouping if there were inverted lists between elements and their parent sequences.

Another representation is in variable length record form where each sequence is a list of elements and the elements repeat across sequences. This representation helps sequences merging and splitting.

Index generation is another aspect of sequence parsing. Although indexes are stored separately, they are only meaningful as long as there are sequences that were used to make the indexes. The indexes are not necessarily a data transformation but efficient representation of indexes enables significant gains in storage and compute and is therefore mentioned with transformations.

Sequences may also be represented in various data formats such as xml and json. These are primarily helpful in Application Programming Interfaces. For example, json representation enables JMESPath (pronounced James Path) query where elements can be extracted and search can be specified via the search operator.

Friday, April 12, 2019

The language of querying Sequences:
The language for querying of data remains the same whether they are sequences or entities. A lot of commercial analysis stacks are built on top of the standard query language. Even datastores that are not relational tend to pick up an adapter that facilitates SQL. This is more for the convenience of the developers rather than any other requirement. This query language is supported by standard query operators in most languages. Many developers are able to switch the data source without modifying their queries. This lets them write their applications once and run them anywhere.
The above holds true for regular dataSets. They don’t hold true for large datasets. In such cases, we have a similar problem as we do with entities in online transactional versus analytical data stores. The accumulation of data in warehouses is not easy to surmount in a simple query operation and the query execution time may take a lot of time. There are two approaches for overcoming the size. Define a snowflake schema in the analytical data store so that the queries can target facts and dimensions hierarchically. Second, de-normalize the schema, shard it and save it on clusters as NoSQL stores and run map-reduce algorithms on it. Both of these approaches are perfectly valid for sequences too and definitely depend on the nature of the queries. That said, since the data is sequence, it does represent somewhat uniform data to manage and is better suited for partitioning. We look at the sequences as a columnar store in such cases and suitably modify the queries as well as its execution. There are a few advantages that come with such queries. First, the query execution returns result early. The results don’t have to come all at once. Second, they can be immensely parallelized. Third the data is independent of each other due to its canonicalized deduplicated atomic storage. Fourth, the data does not have to be represent hierarchy or relations. It can support tags and labels that are added later on. The representation of data is most suitable for transferring to other stores such as stores dedicated for analysis. Fifth, the extract-transform-load operations are simpler to write and execute.
Query performance is not the only reason to store the data in a columnar manner. The data is likely to be processed with a variety of techniques that may include statistical, data mining, machine learning, natural language processing and so on. Each of these packages require certain routines on the data that are best done proprietary to their systems. In such cases, the processing does not benefit from keeping the data accessible on a shared volume in a format that makes accesses race with one another. As long as the accesses can be separated and streamlined, the processing system can use the same data format. There is also the nice benefit that horizontally partitioned columnar representation of independent sequences is easier to import locally to the compute resource where the efficiency for the processing improves. There is no limit to the data that can be processed by a single compute resource if it can be processed a set of one or more sequence at a time.

Thursday, April 11, 2019

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

int binarySearch(List <integer> input, int start, int end, int val)
{
If (start > end) return –1;
int mid = (start + end)/2;
if (input[mid] == val) return mid;
if (start == end && input[mid] != val) return -1;
if (input[mid] < val)
return binarySearch(nums, mid+1, end, val);
else
return binarySearch(nums, start, mid, val);
}

Wednesday, April 10, 2019

Sequences:
We were discussing sequence databases. Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Choice of data structure assists with the processing of sequences. When sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage is better suited for making the sequences web accessible and iterable.

When sequences rules are discovered, they are listed one after the other. There is no effort to normalize them as they are inserted. The ability to canonicalize the groups can be taken on by background tasks. Groups and sequences tend to look very similar when they are repeatedly found and collected. The patterns may also be frequent. Groups may also vary very little from one to another. The prefix tree helps in determining these variations. Using a prefix tree is conveninent for the background tasks and the normalization can keep pace with the online insertions with little lag.
Together the online and offline data modifications may run only as part of an intermediate stage processing where preprocessing and postprocessing steps involve cleaning and prefix generation.

Sequence storage can enable different search operators to run against the same storage. If there is a narrow range of sequences that are to be evaluated, even a LogParser like utility can enable SQL commands to be run against the sequences.

Sequence storage in streams can work with stream insights – a stream processing package. Any stream analysis software can be used. Sequences are not just about scalar elements. Each element may be better understood as a vector with latent information and representation in terms of associations with other elements. When the sequences form pseudo elements, they have a collective representation as a vector with latent information. The dimensions of the vector may be fixed but the sequences can have arbitrary number of elements. While we allow this number to vary we prefer not to let the number exceed a limit so that sequences are manageable. If a sequence is very large and occurs frequently, we can consider it as a combination of sequences with maximum number of elements in each. This helps keep the sequence lengths bounded and helps move the choice of sequences into admission control.
#codingexercise
Sort integers of the array A1 by those of A2.

public List <Integer> sort (List <Integer> A1, List <Integer> A2) {

List <Integer> result = new ArrayList <>();

int start = 0;

A1.sort();

for (Integer a: A2) {

for (Integer index = binarySearch(A1, start, A1.size()-1, a); start < A1.size () && index != -1; start = index +1) {

result.add (A1 [index]);

}

start = 0;

}

if ( result.size () < A1.size () ) {

result.addAll (A1.getRange (result.size ()).sort ());

}

return result;

}

Tuesday, April 9, 2019

Sequences and Groups:
Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Sequences and groups are very similar. Groups are unordered collection of elements whereas the sequences have an order. We use them interchangeably unless specifically calling out either for the absence or presence of ordering. Groups are efficiently represented in terms of elements and their tags. A table of elements can have a group identifier and the groups become easy to form by finding elements with the same group id. This works well for small number of entries. It doesn’t for large number of entries.
If the groups were limited size, we could store the elements along the columns and use a boolean to represent associativity in a group. However, groups may not always have the same elements. These groups then become variable length records.
Instead each group can be considered a string representation of a pseudo entity and stored the same way as entities. However, associated with sequences, we may have a prefix tree to determine similar groups based on prefixes. Bloom filters may help determine sets of groups.
Therefore, choice of data structure assists with the processing. When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage has better implementation of storage best practice.