Cluster computing: Use of cache by stream store clients

The copyStream operation on the Stream store involves the same operations that are invoked as part of client’s request to read from a stream and write to a stream. In this case, each event is actually copied from source to destination stream. The entire copy of all events is fault-tolerant with pause and resume capability just like that for a single event. In this case, the stream store takes a snapshot of the stream and proceeds to copy it without waiting for new events at the tail of the source stream. Since the head and the tail streamcuts for the stream is clearly demarcated, a batch client could be used to iterate over the events in sequence. Each event read is immediately written to the destination stream before proceeding. Prior to the write, the last event is checked and if it is already the same event as the current event to be copied, the event write is skipped. There could be duplicate events in the source stream but during iteration, we have the sequence number and they are distinct even for duplicate events in the source stream. The sequence number or the time noted of the previous write could serve to keep a watermark and progressively move it forward. This detail is internal to the implementation of the stream store, the only requirement is that the selection of events is strictly progressive and the destination write is exactly once. The stream store is particularly capable of doing this.

The copyStream operation can be repeated over and over again to produce clones but if there is already a copy made, then all other operations become idempotent. If additional copies are needed, the existing stream has to be renamed or a different name needs to be specified for the new stream. There is a start and an end and each event is processed for read and write only once which guarantees progression, so the copy operation is guaranteed to complete with the added robustness.

The semantics of copyStream is sufficient for append operation as well. In this case, the append occurs only after the last event in the destination stream. The current segment in the destination stream is closed and all the readers and writers are taken offline prior to the copy operation and when it is complete, the last segment is sealed again.

The copyStream operation can also work with regular reader and writer because the event read has an envelope information such as position that does away with the limitations of the historical reader. These limitations also include the need to specify a start and end to the historical reader. Instead, they are regular reader and writer are better able to include the latest events in the copy operation. The regular reader can stop after a threshold of inactivity or a limit imposed by the writer. These operations continue for a duration.

The notion of copying a stream seems inappropriate for a stream given the emphasis on continuity but the stream can be written to a file. File copy is universally accepted. Therefore, one way to do stream copy is to persist it to a file and then copy and import it back again into another stream. This mechanism is independent of the stream store and loses the authenticity of the streams.

Another approach is to leverage tier2 replication strategy for automatically creating copies and them promoting them to be first class streams of the stream store by adding metadata.

The copyStream functionality could also be implemented directly in a cache or message queue broker that is layered over the stream store. For example, a message broker can queue the segments in the order they need to be stitched and using the store operations may transfer those segments between source and destination. This technique works very well with the methods available from the stream store and the client is able to take the compute out of the store.

There are tradeoffs between performing the copyStream operation at all these levels. Performance increases significantly as we move down the stack from the application to the stream store. Flexibility, access control, pause and resume and other functionalities are available higher up the stack than in the stream store.

The ability to append a stream to an existing stream is the same as copy operation to an empty stream. Therefore, comparisons between various usages of append, copy and stitching are left out of this discussion.

There are a number of applications where one strategy might work better than another. This is left to the discretion of the callers with the minimal required from the stream store described here.

Cluster computing

Saturday, June 27, 2020

Use of cache by stream store clients

No comments:

Post a Comment