Tuesday, April 23, 2019

The fixed length and varying length sequences

When the number of elements in a sequence does not change, it is a fixed length sequence.  The elements of the sequence are similar to N-gram in natural language processing They come with the advantage that the storage can look up elements of the sequence based on position. This is helpful when combined with storing a sequence in the sorted order of elements .

While sequences are stored in a columnar manner with the entire sequence in a single column the fixed number of elements makes it easy to shred the sequence into a table This helps in improving the operations taken on the table. Previously, each sequence representing a string could only participate in query operations when they were indexed. The  indexes on the sequences helped in fast lookup on the same table. If the table is joined to itself, it helps matching sequences to each other. With elements shredded into a table the sequences become far more efficient for querying
Sequences may also have group identifiers associated with them. With this organization, a group can help in finding a sequence and a sequence can help in finding an element. With the help of relations, we can perform standard query operations in a builder pattern.

This is still not amenable to querying as standard tables with known attributes for sequences. Only when the sequences is transformed into sparse table where all the possible elements appear as columns, it becomes easier to perform queries in the traditional well understood manner. Their representation now is similar to word vectors with limited number of dimensions

Another representation is in variable length record form where each sequence is a list of elements and the elements repeat across sequences. This representation helps sequences merging and splitting.



The techniques for reducing noise from sequences:

No comments:

Post a Comment