Cluster computing

Saturday, April 13, 2019

The transformation of sequences:

In the sections following this one, we are describing storage and querying for sequences. However, transfer to storage is not always a linear online data access. Sequences tend to be processed, pruned, cleaned and deduplicated before they arrive into the storage. Those systems that handle this pre-storage processing may choose to supply the data in batches and with Extract-Transform-Load kind of operations. We look at some of these transformations first prior to the storage in a particular format.

The transformations are primarily between the forms of lists or prefix-trees. The lists hold independent entries and the prefix-trees holds organizations based on prefixes. The prefix trees are useful for finding similar sequences. The lists also allow grouping if there were inverted lists between elements and their parent sequences.

Another representation is in variable length record form where each sequence is a list of elements and the elements repeat across sequences. This representation helps sequences merging and splitting.

Index generation is another aspect of sequence parsing. Although indexes are stored separately, they are only meaningful as long as there are sequences that were used to make the indexes. The indexes are not necessarily a data transformation but efficient representation of indexes enables significant gains in storage and compute and is therefore mentioned with transformations.

Sequences may also be represented in various data formats such as xml and json. These are primarily helpful in Application Programming Interfaces. For example, json representation enables JMESPath (pronounced James Path) query where elements can be extracted and search can be specified via the search operator.

Cluster computing

Saturday, April 13, 2019

No comments:

Post a Comment