Cluster computing: Stream store client discussion

Friday, July 3, 2020

Stream store client discussion

There are more runtime execution statistics and metrics available but size and count are included in top level queries.

The S3 storage differs from stream storage in that the S3 does not permit append operation. Data from the objects can be combined into a new object but the object storage leaves it to the user to do so. The streams are continuous and unbounded data and the events can be appended into the stream. The type of read and write access determines one or the other storage. The S3 storage was always web accessible while the stream storage supported gRPC protocol. The participation in data pipeline is not only about access but also how the data is held and forwarded. In the former case, batch oriented pertains to creation of individual objects while in the latter, the events flow from queues to the stream store in a continuous manner with window-based processing.

This comparison leads to the definition of APIs on the data path to be on a per event basis. This is useful for iteration and works well to forward to an unbounded storage. The stream serves this purpose very well and the readers and writers operate on a per-event basis. So, it might seem counterintuitive to implement a copyStream but the operation is valid for any collection of events including the container itself. The stream gets translated to containers on the tier 2 which already support this operation. The provisioning of this operator at the streamManager API level only enhances the devOps requirements for data transfer and programmability improvements to extract-transform-load operators.

The above arguments in favor of a copyStream operation also strengthen the case for data tiering via hot-warm-cold transitions where sealed streams become ready for retention on cheaper and hybrid storage. This transition to age the data for backup or archival is only possible on segments or collections of events.

The archival on tertiary storage is usually from the tier2 and not directly from the streams. The streams can be shortened based on retention period and the portions of the data of the unretained segmentRange then goes into cold storage. The storage for this data then is in the form of a bounded segmentRange which is more suitable for archival. The archival policy between heterogeneous system does not make a copy of the stream as it writes the data read from the source stream and then truncates the stream.

The archival implementation requires administrative intervention. On the other hand, applications for archival can be written and authorized on a stream by stream basis. These applications would need a programmatic way to copy segmentRange. The readers and writers can do this on an event by event basis while the historical reader can read a segmentRange between head and tail streamcuts. If a backup of the entire stream is required, then a copyStream comes helpful because those streams can then be handed over to an administrator or automation for export to tertiary storage.

For thousands of streams, the above method may not scale since we are archiving on a stream by stream basis. A copyScope method can only target all the streams in a scope. The ability to select such streams across scopes for custom archival belongs to an application.

Cluster computing

Friday, July 3, 2020

Stream store client discussion

No comments:

Post a Comment