Application role in vectorized execution:
Analytical software landscape has been trending towards newer forms of pipeline execution. The notion of pipeline execution stems from the practice that data can only be analyzed when it is in the form that analytics software can understand. This has inevitably meant getting the data into the system and preferably one with single unified and infinite storage. Earlier this meant getting data into the system in the form of Extract-Transform-Load with tools long used in the industry. Nowadays, a pipeline is built to facilitate capturing data, providing streaming access and providing infinite storage. Companies are showing increased usage of event based processing and pipelined execution
Pipelined execution involves the following stages:
1) acquisition
2) extraction
3) integration
4) analysis
5) interpretation
The challenges facing pipelined execution involve:
1) scale: A pipeline must hold against a data tsunami. In addition, data flow may fluctuate and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. Column stores and time series database have typically helped with this.
2) Heterogeneity: Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for different data stores has become more important as services have become microservices and independent of each other in their implementations including data storage. Data and their services have a large legacy which also needs to be accommodated in the pipeline. ETL operations and structured storage are still the norm for many industries.
3) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization.
4) Timeliness: Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin.
5) Privacy: User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration
No comments:
Post a Comment