Wednesday, August 14, 2019

We were discussing Apache Kafka and Flink.
Both systems are able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store.  Stream processing applications often take in the incoming events from an event log.  This event log therefore stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is therefore not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. Stateful stream processing has become the norm for event-driven applications, data pipeline applications and data analytics application.
A dedicated storage for stream such as Pravega helps enhance the analysis performance. The choice of this storage guarantees durable, elastic, append-only streams with strong consistency. Streams consist of events in some sequence. A routing key in the events helps group them.  A position is maintained for where a stream is read. Stream segments allow combining different events.
Readers in reader groups can directly access stream segments. An event written to a stream is written to a single Stream segment and its position and existence has a strong consistency. This is done with the help of checkpoints in a Reader group for each reader to persist its state. A state synchronizer can be used to read and make state changes to shared state consistently across a distributed framework.  A zookeeper provides co-ordination mechanism in the Pravega cluster. A bookkeeper helps maintain a quorum and is used to implement a Tier 1 storage within the cluster while a Tier 2 storage of files or objects is maintained external to the cluster. Internal components can be secured by Transport layer security.
A Java client library is provided to write to Pravega over a custom TCP wire protocol. The client library may be directly used by a streaming application that reads and writes events to the stream. Pravega comes with security documentation that explains a role-based access control along with an authentication/authorization API where the auth parameters are sent in the HTTP header. The REST server is hosted by the Pravega controller.

No comments:

Post a Comment