Tuesday, August 13, 2019

Summary of the book "Stream processing with Apache Flink"
The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.
The stream processing is facilitated with the help of DataStream API instead of the DataSet API which is more suitable for batch processing.  Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.
The difference between Kafka and Flink is that Kafka is better suited for data pipelines while Flink is better suited for streaming applications. Kafka tends to reduce the batch size for the processing. Flink tends to run an operator on a continuous basis, processing one record at a time. Both are capable of running in standalone mode.  Kafka may be deployed in a high availability cluster mode whereas Flink can scale up in a standalone mode because it allows iterative processing to take place on the same node. Flink manages its memory requirement better because it can perform processing on a single node without requiring cluster involvement. It has a dedicated master node for co-ordination.
Both systems are able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store.  Stream processing applications often take in the incoming events from an event log.  This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.

No comments:

Post a Comment