Sunday, August 11, 2019

Today we start reading the book "Stream processing with Apache Flink"  Apache Flink is an open source stream processing framework.  The salient features of this framework include:
1) low latency
2) high throughput
3) Stateful and
4) Distributed
The first two  were independently featured by Apache Spark and Apache Storm respectively. Apache Flink brings both with its ability to process a large number of parallel streams. Although traditionally high throughput has been usually done by a  multiplexing of different streams with a publish/subscribe bus, Flink stands out as a  stream processor where events are processed in near-real time manner unlike batch processing.  The order of latency can be to the tune of hours on a batch processing system where as it can be under a second on a stream processor such as Flink. In fact, Flink has the ability to special case batch processing as one of its capabilities
Similarly, data at rest has traditionally involved a database centric architecture. On the other hand, cloud computing trends have boosted the requirement to analyze data in transit.  A stream processor is well-suited for this purpose even if it is for an archivable data such as logs.
Apache Flink is also a distributed stream processor with its capability to scale out.  As it does so, it can perform arbitrary processing on windows of events with storage and access of intermediary data. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store.  Stream processing applications often take in the incoming events from an event log.  This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application. 

No comments:

Post a Comment