Query language and query operators have made writing business logic extremely easy and independent of the data source. This suffices for the most part but there are a few cases when the status quo is just not enough. Enter real-time processing needs and high priority queries, the size of the data, the complexity of the computation and the latency of the response, it begins to become a concern.
Databases have had a long and cherished history in encountering and mitigating query execution responses. However, relational databases pose a significantly different domain of considerations as opposed to NoSQL storage primarily due to the layered and interconnected data requiring scale up rather than scale out technologies. Both have their independent performance tuning considerations.
Stream storage is no different to suffer from performance issues with disparate queries ranging from small to big. The compounded effect of append only data and stream requiring to be evaluated in windows makes iterations difficult. The processing of the streams is also exported out of the storage and this causes significant round trip time and back and forth.
Apache stack has significantly improved the offerings on the stream processing. Apache Kafka and Flink are both able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.
Persistence of streams for intermediate executions helps with reusability and improves the pipelining of operations so the query operators are small and can be executed by independent actors. If we have equivalent of lambda processing on persisted streams, the pipelining can significantly improve performance earlier where the logic was monolithic and proved slow from progressing window to window. There is no distinct thumb rule but the fine-grained operators have proven to be effective since they can be studied.
Streams that articulate the intermediary result also help determine what goes into each stage of the pipeline. Watermarks and savepoints are similarly helpful This kind of persistence proves to be a win-win situation for parallelizing as well as subsequent processing while disk access used to be costly in dedicated systems. There is no limit to the number of operators and their scale to be applied on streams so proper planning mitigates the efforts need to choreograph a bulky search operation.
These are some of the considerations for the performance improvement of stream processing.
No comments:
Post a Comment