Cluster computing

Sequences and Groups:
Sequence databases support specialized processing of sequences with the help of new data structures that are usually not found in traditional storage systems. Sequences tend to number in millions if not more. In this section, we focus on the storage concerns for sequences.
Sequences and groups are very similar. Groups are unordered collection of elements whereas the sequences have an order. We use them interchangeably unless specifically calling out either for the absence or presence of ordering. Groups are efficiently represented in terms of elements and their tags. A table of elements can have a group identifier and the groups become easy to form by finding elements with the same group id. This works well for small number of entries. It doesn’t for large number of entries.
If the groups were limited size, we could store the elements along the columns and use a boolean to represent associativity in a group. However, groups may not always have the same elements. These groups then become variable length records.
Instead each group can be considered a string representation of a pseudo entity and stored the same way as entities. However, associated with sequences, we may have a prefix tree to determine similar groups based on prefixes. Bloom filters may help determine sets of groups.
Therefore, choice of data structure assists with the processing. When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits and the object storage has better implementation of storage best practice.

Cluster computing

Tuesday, April 9, 2019

No comments:

Post a Comment