Tuesday, June 23, 2020

Historical stream readers and exporting events from Stream

: the historical readers don’t give the eventPointer record, because it iterates over the events in the segment. It is not advisable to reconstruct the eventPointer records from the segmentId and the sequence number or event size as offset. The events in any one segment read by the historical reader will be in sequence but their eventPointer record is only known to the stream store and not available to the segmentIterator.
The segmentIterator is also issued one per segment by the historical reader factory. This involves multiple workers to read each segment independently. This means the sequence will not be in continuation between workers which makes the case against the use of event pointer records for historical reads but it is possible sequentially traverse the segments with one worker. The regular stream readers don’t suffer from this disadvantage as they read the next event from start to finish and these boundaries can be adjusted.

The historical reader, also called a batch client because it iterates over a segmentRange, also does not give position of the events. The only indicator that the events returned are in sequence is that the iterator moves from one event to the next in sequence within a given segment and those segments that it is iterating on within its segment range so long as those ranges are iterated segment by segment sequentially by the same worker that is instantiating the iterator.  
A client application that wishes to parallelize this task simply has to provide non-overlapping partitioned segment ranges indicated by a head and a tail streamcuts. This parallelizes the read of the entire stream in terms of partitioned segmentRanges. The interface for the historical reader is available via the BatchClientFactory but the writer cannot be a historical writer.  It has to be a regular writer that writes events one after the other. Therefore, the application reading the events from a segment range in sequence will acquire and hold the writer for the duration of its read. As long as the historical readers follow the same order as the order of segmentRanges in the source stream, the events will be guaranteed to be written in the same order as they are read in the destination stream. This calls for the historical readers to acquire a lock on the writer for synchronization and follow the same order of writes as the order in which they are assigned. The writer becomes a serializer of the different segmentRanges to the destination stream and the writes can be superfast when all the events for the segmentRange are available in memory. Typically, this is not feasible as the number of events and their total size even in a small segment range may far exceed the available memory. Therefore, they spill over to local disk after being read which is again another inelegant solution. 
The stream store does not provide a way to stitch stream segmentRange that is read from the BatchClient. It is also not possible to append one stream to another inside its store. The streamManager provides the ability to create, delete and seal a stream but no copy function yet. The good thing is that it doesn’t have to. The client applications can map the segmentRange order read to the order written without being restricted to one reader and writer. 
As long as the segmentRanges area read and written directly to independent streams on the destination, there is no memory bufferPool or temporary disk required. Once the events from segmentRanges make it to the destination stream store via independent streams, the individual streams can be conflated to a single stream in a separate processing which happens local to the stream store and those independent streams can be done away with. Even the copying operation can be circumvented if the stream store provide ability to stitch different streams together and until then, it is still cheap to do.

No comments:

Post a Comment