Tuesday, June 30, 2020
Stream operations continued
Monday, June 29, 2020
StreamManager enhancements
The ability to append a stream to an existing stream is the same as copy operation to an empty stream. Therefore, comparisons between various usages of append, copy and stitching are left out of this discussion.
There are a number of applications where one strategy might work better than another. This is left to the discretion of the callers with the minimal required from the stream store described here.
File management functionality of archiving is another capability that is commonly used for export or import. Stream store has metadata file as well as data both of which can be persisted on tier 2 and imported as exported.
This functionality is not native to the stream store. Yet high level primitives on stream manager interface alleviates the concern from out of band access and possible tampering. It also makes programmability easier to work with.
Until now, stream store has largely relied on open, close, read and write functionalities primarily as it tries to become at par with the demands of the stream processing language but these are mostly for saving and analysis purposes leaving much of the data transfer in the hands of the administrator.
The data import or export is certainly something that belongs outside the analytical applications for the most part but that does not leave it completely out of automations and particularly those that server data pipelines. Modern applications are increasingly targeting continuous data and streams serve very well in data pipelines with a little bit more enhancement to the programmability.
The stream manager is also the right interface to add these capabilities required at the stream store level from these automation workflows. Most of the existing methods on that interface target streams and the operations taken on the stream as a whole and these automation workflows are served by equivalent commands on the same stream. The new capabilities can be added cleanly without any issues with the earlier releases.
One of the advantages of providing this capability is that the applications can close the stream manager when they are done with its use allowing proper finalization in all cases. The stream manager interface already comes with the close method to facilitate this.
Sunday, June 28, 2020
Application troubleshooting continued
Ephemeral streams
Stream processing unlike batch processing runs endlessly as and when new events arrive. Applications use state persistence, watermarks, side outputs, and metrics to monitor progress. These collections serve no other purpose beyond the processing and are a waste of resources when there are several processors. If they are disposed off at the end of processing, they become ephemeral in nature.
There is no construct in Flink or stream store client libraries to automatically cleanup at the end of the stream processing. This usually falls on the application to do the cleanup. There are a number of applications that don’t. And these leave behind lots of stale collections. It aggravates when data is copied from system to system or scope to scope.
The best way to tackle auto closure of streams is to tie it to the constructor or destructor and invoking the cleanup from their respective resource providers.
If the resources cannot be scoped because they are shared, then workers are most likely attaching or detaching to these resources. In such cases they can be globally shut down just like the workers at the exit of the process.
User Interface
The streaming applications are a new breed. Unlike time series database that have popular dashboards and their own query language, the dashboards for stream processing are constructed one at a time. Instead it could evolve into a Tableau or visualization library that decouples the rendering from the continuous data source. Even the metrics dashboard or time series data support a notion of windows and historical queries. They become inadequate for stream processing only because of the query language. In some cases, they are proprietary. It would be much better if they can reuse the stream processing language and construct queries that can be fed to active applications already running in the Flink Runtime.
Each query can be written to an artifact, deployed and executed on the FlinkCluster in much the same way as lambda processing does serverless computing except that in this case it is hosted on the FlinkCluster. Also, the query initiation is a one-time cost and its execution will not bear that cost again and again. The execution will result in a continuous manner with windowed results which can be streamed to the dashboard. Making web requests and responses from the application to the dashboard and vice versa is easy because the application has network connectivity.
Query Adapters
Many dashboards from different analysis stacks such as influxdb, SQL, come with their own query language. These queries are not all similar to stream processing but they can be expressed using Apache Flink. These dashboards can be supported if the queries can be translated. The Flink language has the support for within with tables and batches. The querying translation is easier with parametrization and templates for common queries.
Instead of query adapters, it is also possible to import and export data to external stacks that have dedicated dashboards. Also, development environments like Jupyter notebook are also able to promote other languages and dashboards if the data is made available. This data import and export logic is usually encapsulated in data transformation operators that are usually called extract-transform-load operators.
Saturday, June 27, 2020
Use of cache by stream store clients
Friday, June 26, 2020
Use of cache by stream store clients
Thursday, June 25, 2020
Use of cache by stream store clients ( continued from previous article )
The stitching of segment ranges can be performed in the cache instead of the stream store by the design mentioned yesterday. Even a message queue broker or a web accessible storage can benefit the use of conflating of streams
As long as the historical readers follow the same order as the order of segmentRanges in the source stream, the events will be guaranteed to be written in the same order in the destination stream. This calls for the historical readers to send their payload to a blocking Queue and the writer will follow the same order of writes as the order in which they are in the queue . The blocking Queue backed by the cache becomes a serializer of the different segmentRanges to the destination stream and the writes can be superfast when all the events for the segmentRange are available in memory
The stitching of streams into a single stream at the stream store varies considerably from the same logic in the clients. The stream store not only is the best place to do so but it also is the authoritative one. The clients can conflate the stream from a source to a destination regardless of whether the destination is the same store from which the stream was read. Instead, the stream store is the one that can maintain the integrity of the streams as they are stitched together. The resulting stream is guaranteed to be the true one as compared to the result of any other actors.
The conflation of stream at the stream store is an efficient operation because the events in the original streams now become part of the conflated stream if the original stream does not need to retain its identity. The events are also not copied one by one in this case. The entire segmentRange of the streams to be conflated will simply have its metadata replaced with what corresponds to the new stream. The rewrite of metadata across all the participating streams in the conflation is then a low cost super fast operation.
The stream store does not need anything more complicated than a copyStream operation which can then be appended to a stream builder. Each stream already knows how to append an event at the tail. If a collection of events cannot be appended, a copy stream operation will enable the events to be duplicated and since it is does not have an identity, it can be stitched with builder stream by rewriting the metadata. This can be done on an event by event basis also but that already is exposed via the reader and writer interface. Instead, the copyStream operation on the stream manager can handle a collection of events at a time.