Cluster computing

Tuesday, December 18, 2018

Today we continue discussing the best practice from storage engineering:

191) A pipeline must hold against a data tsunami. In addition, data flow may fluctuate and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. If the storage tier cannot accommodate a single pipeline from being elastic, it must mention its Service Level Agreement clearly

192) Data and their services have a large legacy in the form of existing products, process and practice. A storage tier cannot disrupt any of these and therefore must provide flexibility to handle such diverse data transfers as Extract-Transform-Load and analytical pipeline.

193) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization. Depending on the nature of the pipeline, the storage tier may also change.

194) Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin. Consequently, a storage tier can at best not require frequent and repeated data transfers from the user. Object Storage, for instance, handles multi-zone application automatically

195) User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration. If the storage tier stores any of these information in the clear during or after the data transfer from the pipeline, it will not only be a security violation but also fail compliance.

Cluster computing

Tuesday, December 18, 2018

No comments:

Post a Comment