Saturday, September 12, 2020

Sorted events

 Sorted Events 

This article is a brief glimpse of a feature for a stream store that is known for its limitless continuous storage of device data traffic. Events arrive at the time store sequenced by their timestamps and the stream store guarantees that they will be saved in the same order that they arrive. Other than that, the stream store does not make any assumptions or interpret the contents of the events in a stream. This leads to a design where all events must be iterated sequentially one after the other in the same original order that they were preserved. Queries reading the stream have to start from the beginning of a stream each time they have to repeat a traversal of a sequence of events unless they mark the delimiters with the help of StreamCuts. 

There are a few problems we can solve further. First, the timestamp of an event is not a user-friendly field and certainly inconvenient for comparison and sorting. Queries must maintain their own sequence numbers if they would like the ability to sort event references in arrays. Given a set of events, there is no lookup table to see where they are located in a stream. Client applications that write these queries to the store have to build cumbersome logic or delegate it to stream processing packages such as Flink. Even if Flink is used, the iteration cannot be ruled out as the syntax relies on processing event one by one which is still O(N) complexity and N tends to be large in a stream.

Assuming that events are sequenced by numbers and that the client would like to skip events by increasing multiples of two, there are no events handy next link access between events as there would be in a skip-list. The events lose their activity as they become old and are replaced by newer events. A certain portion towards the head of the stream marks the cold portion of the stream where the events cease to be as relevant or meaningful as the events arriving at the tail end. By earmarking retention period, the head is folded towards the warm range of the stream but the data is gone if it was in the truncated segments of the stream. If there were a metadata such as table segment that could hold key values that allows events or even stream ranges to have auxiliary information in terms of skip-level-access, then we make the traversal faster logarithmically.

This kind of metadata for introducing newer and faster semantics to iteration allows access between cold-warm-hot segment ranges of a stream in both forward and backward manner by readjusting the head to the start of the event to be read. The metadata is often isolated and stored with a data structure different from that used for data which implies there is little or no risk of corruption to the original events in the stream. Only a shuffling of StreamCuts is sufficient to quickly jump back and forth between different sequences of events.

Since the events maintain their order in the stream from the time of arrival, the metadata created for the events also tend to be incremental in their changes with the nice side effect that code working on an earlier range does not have to change.

There is already a batch reader that can read a historical segment range between a head and a tail streamcut of a stream. By providing different head and tail references using the above-mentioned metadata, it is now possible to access the events anachronistically and with fewer reads.


No comments:

Post a Comment