Cluster computing: Stream store discussion continued

Thursday, July 2, 2020

Stream store discussion continued

The S3 storage differs from stream storage in that the S3 does not permit append operation. Data from the objects can be combined into a new object but the object storage leaves it to the user to do so. The streams are continuous and unbounded data and the events can be appended into the stream. The type of read and write access determines one or the other storage. The S3 storage was always web accessible while the stream storage supported gRPC protocol. The participation in data pipeline is not only about access but also how the data is held and forwarded. In the former case, batch oriented pertains to creation of individual objects while in the latter, the events flow from queues to the stream store in a continuous manner with window-based processing.

This comparison leads to the definition of APIs on the data path to be on a per event basis. This is useful for iteration and works well to forward to an unbounded storage. The stream serves this purpose very well and the readers and writers operate on a per-event basis. So, it might seem counterintuitive to implement a copyStream but the operation is valid for any collection of events including the container itself. The stream gets translated to containers on the tier 2 which already support this operation. The provisioning of this operator at the streamManager API level only enhances the devOps requirements for data transfer and programmability improvements to extract-transform-load operators.

The above arguments in favor of a copyStream operation also strengthen the case for data tiering via hot-warm-cold transitions where sealed streams become ready for retention on cheaper and hybrid storage. This transition to age the data for backup or archival is only possible on segments or collections of events.

Cluster computing

Thursday, July 2, 2020

Stream store discussion continued

No comments:

Post a Comment