The S3 storage differs from stream storage in that the S3 does not permit append operation. Data from the objects can be combined into a new object but the object storage leaves it to the user to do so. The streams are continuous and unbounded data and the events can be appended into the stream. The type of read and write access determines one or the other storage. The S3 storage was always web accessible while the stream storage supported gRPC protocol. The participation in data pipeline is not only about access but also how the data is held and forwarded. In the former case, batch oriented pertains to creation of individual objects while in the latter, the events flow from queues to the stream store in a continuous manner with window-based processing.
This comparison leads to the definition of APIs on the data path to be on a per event basis. This is useful for iteration and works well to forward to an unbounded storage. The stream serves this purpose very well and the readers and writers operate on a per-event basis. So, it might seem counterintuitive to implement a copyStream but the operation is valid for any collection of events including the container itself. The stream gets translated to containers on the tier 2 which already support this operation. The provisioning of this operator at the streamManager API level only enhances the devOps requirements for data transfer and programmability improvements to extract-transform-load operators.
The above arguments in favor of a copyStream operation also strengthen the case for data tiering via hot-warm-cold transitions where sealed streams become ready for retention on cheaper and hybrid storage. This transition to age the data for backup or archival is only possible on segments or collections of events.
The archival on tertiary storage is usually from the tier2 and not directly from the streams. The streams can be shortened based on retention period and the portions of the data of the unretained segmentRange then goes into cold storage. The storage for this data then is in the form of a bounded segmentRange which is more suitable for archival. The archival policy between heterogeneous system does not make a copy of the stream as it writes the data read from the source stream and then truncates the stream.
The archival implementation requires administrative intervention. On the other hand, applications for archival can be written and authorized on a stream by stream basis. These applications would need a programmatic way to copy segmentRange. The readers and writers can do this on an event by event basis while the historical reader can read a segmentRange between head and tail streamcuts. If a backup of the entire stream is required, then a copyStream comes helpful because those streams can then be handed over to an administrator or automation for export to tertiary storage.
For thousands of streams, the above method may not scale since we are archiving on a stream by stream basis. A copyScope method can only target all the streams in a scope. The ability to select such streams across scopes for custom archival belongs to an application. This application can package the stream with metadata for export offline by an administrator. The metadata would be in the same form as any other event in the stream. This addition improves the packaging and visibility into the export so they can be restored if necessary. StreamStore can handle data export and import to its tier2 but the export and import across storage belongs to an application. The application does not have to be a client of one stream store. It can be a client of two stream stores where one store acts as the source while the other stream acts as the destination. StreamStore can also act as a source or a destination if an alternate type of store is used as the destination or source respectively.
Since long running applications such as these involving archival and copyStream operations are subject to failures, pause and resume, the applications could publish notifications. These notifications could be granular and raised for every segment in the stream. The segment boundaries at which the notifications are sent could correspond to the segments in the source stream.
No comments:
Post a Comment