Thursday, April 18, 2019

The sequence joins

Unlike relational tables, sequences are generally listed in a columnar manner. Since each sequence represents a string, the indexes on the sequences help in fast lookup on the same table. If the table is joined to itself, it helps matching sequences to each other.
Prefix trees help with sequences comparisons based on prefixes. Unlike joins, were the values have to match, prefix trees help with unrelated comparisons. Prefix trees also determine the levels of the match between the sequences and this is helpful to determine how close two sequences are.  The distance between two sequences is the distance between the leaves of the prefix trees.  This notion of similarity measure is also helpful to give a quantitative metric that can be used for clustering.
Common techniques for clustering involve assigning sequences to the nearest cluster and forming cohesive cluster by reducing the sum of squares of errors.

Besides these usual forms of representation, there is nothing preventing breaking up the sequences into an elements table with relation to the sequence table. Similarly, sequences may also have group identifiers associated with them. With this organization, a group can help in finding a sequence and a sequence can help in finding an element. With the help of relations, we can perform standard query operations in a builder pattern.

Wednesday, April 17, 2019

The metrics for Sequence analysis:

Sequences can be generated in large numbers. Their processing can take arbitrary time. Therefore, there is a need to monitor and report progress on all activities associated with sequences.

Metrics for sequences can be duration based which includes elapsed time.  If there are a million records and clustering them takes more time with one algorithm than other, elapsed time can help determine the right choice.

Metrics for sequences can also include count of sequences. If the number of sequences processed stalls or they are processed far too quickly for results, then they refer to some inconsistency. The metrics in this case helps to diagnose and troubleshoot.

Metrics can also be scoped to partitions while global ones are maintained separately. Metrics can also tags or namespaces associated with the same physical resource.

Metrics can support a variety of aggregations such as sum(), average() and so on. These can be executed at different scopes as well as globally. Metrics may be passed as parameters in the form of time series array.

When sequences rules are discovered, they are listed one after the other. There is no effort to normalize them as they are inserted. The ability to canonicalize the groups can be taken on by background tasks. Together the online and offline data modifications may run only as part of an intermediate stage processing where preprocessing and postprocessing steps involve cleaning and prefix generation. Metrics give visibility to these operations.

Tuesday, April 16, 2019

The reporting on Sequence Analysis:
Sequences analysis follows similar channels of reporting as any other data store. The online transactional aspect of reporting is separated from read only reporting stacks. Unless the reporting requires temporary or permanent storage, it can pretty much be read only. Reporting stacks for sequences can form good charts and graphs by virtue of the stack they use. Most stacks provide conventional forms of representing trends and patterns while some go the extra length to show case cloudmap charts and interactive visualizations including drilldowns.
Sequence visualizations also need to support sequence legends. The sequences don't have aliases or identifiers so they tend to clutter up the chart or the graph. When the aliases are represented via legends, the size and number of entries in the legend grows considerably. Therefore, they need to be stored separate from the chart or graph.
Sequences are formed from elements. These elements are also of interest as they tend to repeat across sequences. Any visualization targeting the sequence becomes more interesting when it can show patterns in elements also. This overlay of charts is not typical for many other plots. The sequence patterns and trends can be shown even in a custom manner via programming stacks such as JavaScript and JQuery.

Sequences are formed from elements. These elements are also of interest as they tend to repeat across sequences. Any visualization targeting the sequence becomes more interesting when it can show patterns in elements also. This overlay of charts is not typical for many other plots. The sequence patterns and trends can be shown even in a custom manner via programming stacks such as JavaScript and JQuery.

While element and sequence patterns are interesting, they may find their way into separate charts. It is not necessary to overlay them on top of each other in all cases and keeping them side by side allows the user to go back and forth between the two 

Monday, April 15, 2019

The unification of processing over sequences:
While we have elaborated on dedicated storage types for sequences, processing over these storage types has merely been enumerated as sequential, batch, or stream oriented processing depending on the data store. However, in this section, we would like to say that processing is for analytics and it is not easy to confine them to specific object storages. Analytical packages like Apache FLink are tending towards combining options for processing while remaining a champion of say stream processing.
There are a few advantages to this scenario for the developers engaging in analysis. First, they have a broad range of capabilities handy from the same package. The code that they write is more maintainable, written once and targets all the capabilities via a single package. Second, the package decouples the processing from the storage concerns allowing algorithms to change for the same strategy and the same data set. Third, it is easier for the developers to target the data set with the same package if they do not have to concern themselves about scaling to large data sizes. When the data sizes inrease to orders of magnitude, code to process the sequences changes considerably but if the onus is taken by a package rather than the custom code from the developer, then it improves considerably requiring little or no attention for the future. Finally, the packages are themselves are tried and tested with integration to other packages or storage products that serve tier2 storage over stream, blobs, files or blocks. This makes it more appealing to use the same package for a variety of purposes. Open source packages have demonstrated code reusability far more than buy your own options.  Code written against open source packages is also suitable to code being published to other
When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits



Sunday, April 14, 2019

Given an integer array, generate two adjacent subsequences, if possible, where one is strictly increasing and another is strictly decreasing. 
Pair<List<Integer>, List<Integer>> getSubsequences(List<Integer> input) { 
        for (int I = 0; I < input.size(); I++) { 
                List<Integer> before = input.subList(0, i); 
                List<Integer> after = input.subList(i+1, input.size()-1); 
                if (isIncreasing(before) && isDecreasing(after)) { 
                        return new Pair<List<Integer>, List<integer>( 
input.subList(0, i) 
input.subList(i+1, input.size()-1)); 
                } 
        } 
        Return null; 
} 
boolean isIncreasing(List<Integer> input) { 
      List<integer> sorted = new ArrayList<Integer>(input); 
      sorted.sort(); 
      Return sorted.equals(input); 
} 
Boolean isDecreasing(List<Integer> input) { 
    List<Integer> reversed = new ArrayList<>(Lists.reverse(input));     Return IsIncreasing(reversed); } 
If the subsequences don’t have to be adjacent, we can a subsequence to be within the other. In such case, getSubsequences would expand to include the discovered subsequence and the remainder. 
For (int i = 0; I < input.size(); I++) 
{ 
    For (int j = I+1; j < input.size(); j++)  
       { 
List<Integer> section = input.subList(I,j); 
If (isIncreasing(section) { 
    List<Integer> remainder = new ArrayList<>(); 
    remainder.addAll(input.subList(0,I)); 
    remainder.addAll(input.subList(j+1,input.size()-1)); 
    if (isDecreasing(remainder)) { 
            return new Pair<List<Integer>, List<Integer>>(section, remainder); 
    } 
} 
       } 
} 

Saturday, April 13, 2019

The transformation of sequences: 

In the sections following this one, we are describing storage and querying for sequences. However, transfer to storage is not always a linear online data access. Sequences tend to be processed, pruned, cleaned and deduplicated before they arrive into the storage. Those systems that handle this pre-storage processing may choose to supply the data in batches and with Extract-Transform-Load kind of operations. We look at some of these transformations first prior to the storage in a particular format.  

The transformations are primarily between the forms of lists or prefix-trees.  The lists hold independent entries and the prefix-trees holds organizations based on prefixes. The prefix trees are useful for finding similar sequences. The lists also allow grouping if there were inverted lists between elements and their parent sequences. 

Another representation is in variable length record form where each sequence is a list of elements and the elements repeat across sequences. This representation helps sequences merging and splitting.  
Index generation is another aspect of sequence parsing. Although indexes are stored separately, they are only meaningful as long as there are sequences that were used to make the indexes. The indexes are not necessarily a data transformation but efficient representation of indexes enables significant gains in storage and compute and is therefore mentioned with transformations.  

Sequences may also be represented in various data formats such as xml and json. These are primarily helpful in Application Programming Interfaces. For example, json representation enables JMESPath (pronounced James Path) query where elements can be extracted and search can be specified via the search operator. 

Friday, April 12, 2019

The language of querying Sequences:
The language for querying of data remains the same whether they are sequences or entities. A lot of commercial analysis stacks are built on top of the standard query language. Even datastores that are not relational tend to pick up an adapter that facilitates SQL. This is more for the convenience of the developers rather than any other requirement. This query language is supported by standard query operators in most languages. Many developers are able to switch the data source without modifying their queries. This lets them write their applications once and run them anywhere.
The above holds true for regular dataSets. They don’t hold true for large datasets. In such cases, we have a similar problem as we do with entities in online transactional versus analytical data stores. The accumulation of data in warehouses is not easy to surmount in a simple query operation and the query execution time may take a lot of time. There are two approaches for overcoming the size. Define a snowflake schema in the analytical data store so that the queries can target facts and dimensions hierarchically. Second, de-normalize the schema, shard it and save it on clusters as NoSQL stores and run map-reduce algorithms on it. Both of these approaches are perfectly valid for sequences too and definitely depend on the nature of the queries. That said, since the data is sequence, it does represent somewhat uniform data to manage and is better suited for partitioning. We look at the sequences as a columnar store in such cases and suitably modify the queries as well as its execution.  There are a few advantages that come with such queries. First, the query execution returns result early. The results don’t have to come all at once. Second, they can be immensely parallelized. Third the data is independent of each other due to its canonicalized deduplicated atomic storage. Fourth, the data does not have to be represent hierarchy or relations. It can support tags and labels that are added later on.  The representation of data is most suitable for transferring to other stores such as stores dedicated for analysis. Fifth, the extract-transform-load operations are simpler to write and execute.
Query performance is not the only reason to store the data in a columnar manner. The data is likely to be processed with a variety of techniques that may include statistical, data mining, machine learning, natural language processing and so on.  Each of these packages require certain routines on the data that are best done proprietary to their systems. In such cases, the processing does not benefit from keeping the data accessible on a shared volume in a format that makes accesses race with one another. As long as the accesses can be separated and streamlined, the processing system can use the same data format. There is also the nice benefit that horizontally partitioned columnar representation of independent sequences is easier to import locally to the compute resource where the efficiency for the processing improves. There is no limit to the data that can be processed by a single compute resource if it can be processed a set of one or more sequence at a time.