Wednesday, September 16, 2020

Metrics continued

 We were discussing a set of features for stream store that brings the notion of accessing events in sorted order with skipped traversal.  The events can be considered to be in some predetermined sequence in the event stream whether it is by offset or by timestamp. These sequence numbers are in sorted order. Accessing any event in the stream, as if considered to be in a batch bounded by a head and a tail StreamCut that occur immediately before and after the event respectively, is now better than the linear traversal to read the event. This makes the access to the event from the historical set of events in the stream to be O(log N). The skip-level access links in the form of head and tail streamcuts can easily be built into the metadata on a catch-up basis after the events are accrued in the stream.

In addition, we have the opportunity to collect fields and the possible values that occur in the events to allow them to be leveraged in queries subsequently. This enhancement of metadata from events in the stream becomes useful to find similar events

The use of standard query operators with the events in the stream has been made possible by the Flink programming library but the logic written with those operators usually is not aware of all the fields that has been extracted from the events. By closing the gap between the field extraction and new fields in query logic, the applications can not only improve existing logic but also write new ones.

The extraction of fields and their values provides an opportunity to not only discover the range of values that certain keys can take across all the events in the stream but also their distribution. Events are numerous and there is no go-to source for statistics about events especially similar looking events. Two streams having similar events may have them in incredibly different order and arrival times. If the stream store is unaware of the contents of the events, it can tell the number of events and the size made up by those events. But with some insight into the events such as the information about their source, a whole new set of metrics are now available which can help with summary information, point of origin troubleshooting, contribution/spread calculation, and better resource allocations.

Metrics provide handy information about the stream that would otherwise have to be interpreted by running offline analysis on logs. Summary statistics from metrics can now be saved with metadata.

There are a few cases where the metadata might be little or none. This includes encrypted data and binary data. The bulk of the use cases for the stream store instead deals with textual data. So, the benefits from the features mentioned above are not marginal. 


Client applications can do without the support for such metadata from the stream store but it is cumbersome for them as they inject intermittent events with summary data that they would like to publish for their events. These special events can be skipped during data reads from the stream. As with any events, they will need to be iterated. 


Clients would find it inconvenient to write different types of events to the same stream unless they use different payloads in the same type of envelope or data type. On the other hand, it is easier for the clients to have readers and writers to read a single type of event from a stream. This favors the client to choose a model where they store processing related information in a dedicated stream. The stream store however can create internal streams specifically for the purpose of metadata. The latter approach ties the setup and teardown of the internal streams to the data stream and becomes a managed approach. 


The stream store does not force a centralized catalog. It maintains metadata per stream. Streams are isolated and the readers and writers can scale. The metrics and statistics that become part of the metadata reflect holistic information on the stream. There is a possibility that the metadata updates may lag the incoming data rate but the difference might be negligible. Also, the metadata survives stream store restart and the updates are typically incremental. 

#metric

Integer totalSizeOfEvents = ranges.stream() 

      .map(range - > getSizeOfEvents(range))  

     .reduce(0, Integer::sum) ; 

No comments:

Post a Comment