Sunday, October 20, 2019

Storage products are a veritable data destination for workflows originating outside the product. Their operations generate logs which can also be stored in the same product providing portability, availability and participation in the data plane which comes with guarantees that are typical for data. This immensely improves introspection of the product as various logs, metrics and events can be saved directly in the store itself.  Queries can be written against this introspection data. In particular, Flink can be used with any tier 2 such as Aliyun Object Storage Service or Stream storage. These stores can be treated as append-only data storage with skip-level access introduced for efficient querying.

With the use case of storing the cluster logs within the storage product, let us say we write an FlinkApplication to continuously monitor the number of times there are  delays in accessing the tier 2 storage and the distribution of these delays, analyze the data storage and data access patterns with segments, leverage cache where it helps to avoid accessing data segments directly from the store and finally add additional syntax to segments to facilitate improved performance to queries. 

We propose that segments creation strategies do not affect their augmentation with skip level access. Only a handful of skip-levels with orders of magnitude of 2 is required to speed up access of segments. Back references are not required with as many look ahead segments ensured to be available as the highest skip level for each new segment. Groups of segments may be treated as units for higher level strategies which introduces a secondary layer of segment nomenclature. 

 Stream readers read the stream from beginning to end. The number of readers is allowed to scale arbitrarily because the cost for computation is cheap and it can be parallelized. However, data access and storage performance suffer. When the higher-level semantics of segments are maintained, the techniques for accessing and retrieving segments become irrelevant.  


With the number of readers reduced, the performance of data follows the machine efficiency curve

No comments:

Post a Comment