Friday, April 12, 2019

The language of querying Sequences:
The language for querying of data remains the same whether they are sequences or entities. A lot of commercial analysis stacks are built on top of the standard query language. Even datastores that are not relational tend to pick up an adapter that facilitates SQL. This is more for the convenience of the developers rather than any other requirement. This query language is supported by standard query operators in most languages. Many developers are able to switch the data source without modifying their queries. This lets them write their applications once and run them anywhere.
The above holds true for regular dataSets. They don’t hold true for large datasets. In such cases, we have a similar problem as we do with entities in online transactional versus analytical data stores. The accumulation of data in warehouses is not easy to surmount in a simple query operation and the query execution time may take a lot of time. There are two approaches for overcoming the size. Define a snowflake schema in the analytical data store so that the queries can target facts and dimensions hierarchically. Second, de-normalize the schema, shard it and save it on clusters as NoSQL stores and run map-reduce algorithms on it. Both of these approaches are perfectly valid for sequences too and definitely depend on the nature of the queries. That said, since the data is sequence, it does represent somewhat uniform data to manage and is better suited for partitioning. We look at the sequences as a columnar store in such cases and suitably modify the queries as well as its execution.  There are a few advantages that come with such queries. First, the query execution returns result early. The results don’t have to come all at once. Second, they can be immensely parallelized. Third the data is independent of each other due to its canonicalized deduplicated atomic storage. Fourth, the data does not have to be represent hierarchy or relations. It can support tags and labels that are added later on.  The representation of data is most suitable for transferring to other stores such as stores dedicated for analysis. Fifth, the extract-transform-load operations are simpler to write and execute.
Query performance is not the only reason to store the data in a columnar manner. The data is likely to be processed with a variety of techniques that may include statistical, data mining, machine learning, natural language processing and so on.  Each of these packages require certain routines on the data that are best done proprietary to their systems. In such cases, the processing does not benefit from keeping the data accessible on a shared volume in a format that makes accesses race with one another. As long as the accesses can be separated and streamlined, the processing system can use the same data format. There is also the nice benefit that horizontally partitioned columnar representation of independent sequences is easier to import locally to the compute resource where the efficiency for the processing improves. There is no limit to the data that can be processed by a single compute resource if it can be processed a set of one or more sequence at a time.

No comments:

Post a Comment