Cluster computing

Tuesday, October 8, 2019

This post tries to discover the optimum access patterns for the stream storage. The equivalent for web accessible storage has proven to be most efficient with a batched consistently uniform periodic writes such as those for object storage. The streams are read from the beginning to the end and this has generally not been a problem because the workers doing the read can scale out with low overhead. The writes are always at the end. Temporary storage with the help of Bookkeeper and coordination with zookeeper helps with the writes to be append only

However statistics collected by Bookkeeper and Zookeeper can immensely improve the discoverability of the access pattern. These statistics are based on the table of stream and their accessors count with attributes. As the workload varies, it may show that some streams are read heavy such as for analysis and others are write heavy such as for IoT traffic. The statistics may also show that even some segments within the stream are more popular than others. While this may have any distribution, the idea that the heavily used segments can benefit with a cache is not disputed

The cache for streams is a write through cache because the writes are at the end of the stream. These writes are hardly a concern for cache and their performance is mitigated by Bookkeeper. Batched writes from Bookkeeper are not different from a periodic backup schedule.

The cache layer for stream store benefits from being close to the stream store. The segments can have a naming convention or a hash.

Reference to a case study for event stream :
https://1drv.ms/w/s!Ashlm-Nw-wnWu2hMRA7zp__mivoo

Cluster computing

Tuesday, October 8, 2019

No comments:

Post a Comment