Monday, August 12, 2019

Today we continue reading the book. "Stream processing with Apache Flink"  We discussed that the Apache Flink is an open source stream processing framework. Apache Flink which featured 1) low latency 2) high throughput 3) stateful and 4) distributed processing. Flink is a stream processor that can scale out. It can perform arbitrary processing on windows of events with storage and access of intermediary data. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism. The checkpointing can persist the local state to a remote store.  Stream processing applications often take in the incoming events from an event log.  This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections.
The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.
The stream processing is facilitated with the help of DataStream API instead of the DataSet API which is more suitable for batch processing.  Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.


No comments:

Post a Comment