The separation of data records to process and optimization
of the operations on those records to improve the efficiency can be leveraged
to reduce the time for all the records by parallelizing the queries on separate
workers. Since the set of records are not shared and only the results are
shared, the workers merely have to share the reports of the processing to a
central accumulator and that reduces the sequential time by a factor of the
number of workers.
Query language, tools and products come with nuances,
built-ins and even features that can further help to analyze, optimize and
rewrite queries so that they perform better than others. Some form of query
execution statistics is also made available via the store or from profiling
tools. There is also a way to improve the efficiency of the queries by breaking
up its organizational structure and introducing pipelining so the results of
one stage as passed to another can be studied.
Pipelined execution involves the following stages:
1) acquisition
2) extraction
3) integration
4) analysis
5) interpretation
The challenges facing pipelined execution involve:
1) scale: A pipeline must hold against a data tsunami. In
addition, data flow may fluctuate, and the pipeline must hold against the ebb
and the flow. Data may be measured in rate, duration and size and the pipeline
may need to become elastic. Column stores and time series database have
typically helped with this.
2) Heterogeneity: Data types – their format and semantics
have evolved to a wide degree. The pipeline must at least support primitives
such as variants, objects and arrays. Support for different data stores has
become more important as services have become microservices and independent of
each other in their implementations including data storage. Data and their
services have a large legacy which also needs to be accommodated in the
pipeline. ETL operations and structured storage are still the norm for many
industries.
3) Extension: ETL may require flexibility in extending logic
and in their usages on clusters versus servers. Microservices are much easier
to be written. They became popular with Big Data storage. Together they have
bound compute and storage into their own verticals with data stores expanding
in number and variety. Queries written
in one service now need to be written in another service while the pipeline may
or may not support data virtualization.
4) Timeliness: Both synchronous and asynchronous processing
need to be facilitated so that some data transfers can be run online while
others may be relegated to the background. Publisher-subscriber message queues
may be used in this regard. Services and
brokers do not scale as opposed to cluster- based message queues. It might take
nearly a year to fetch the data into the analytics system and only a month for
the analysis. While the benefit for the user may be immense, their patience for
the overall time elapsed may be thin.
5) Privacy: User’s location, personally identifiable
information and location-based services are required to be redacted. This
involves not only parsing for such data but also doing it over and over
starting from the admission control on the boundaries of integration
No comments:
Post a Comment