Introduction:
This article is a continuation of the series of articles
starting with the description of SignalR service. In this article, we continue with our
study of Azure Stream Analytics from the last article. We were
comparing Apache Flink and Kafka with The Azure Stream Analytics and were
observing the utilization of Kubernetes to leverage containers for the clusters
and running jobs for analysis. One of the most interesting applications of
stream processing is the support for watermarks and we explore this comparison
next.
Flink provides
three different types of processing based on timestamps which are independent
of the above two methods. There can be three different types of timestamps corresponding
to: processing time, event time and ingestion time.
Out
of these only the event time guarantees completely consistent and deterministic
results. All three processing types can be set on
the StreamExecutionEnvironment prior to the execution of
queries.
Event
time also support watermarks. Watermarks is the mechanism in Flink to
measure progress in event time. They are simply inlined with the
events. As a processor advances its timestamp, it introduces a watermark for
the downstream operators to process. In the case of distributed systems where
an operator might get inputs from more than one streams, the watermark on the
outgoing stream is determined from the minimum of the watermarks from the
invoking streams. As the input streams update their event times, so does the
operator. Flink also provides a way to coalesce events within the
window.
Flink-connector
has an EventTimeOrderingOperator. This uses watermark and managed
state to buffer elements which helps to order elements by event
time. This class extends the AbstractStreamOperator and
implements the OneInputStreamOperator. The last seen watermark is
initialized to min value. It uses a timer service
and mapState stashed in the runtime Context. It processes each
stream record one by one. If the event does not have a timestamp, it
simply forwards. If the event has a timestamp, it buffers all the events
between the current and the next watermark.
When
the event Timer fires due to watermark progression, it polls all the event
time stamp that are less than or equal to the current watermark. If
the timestamps are empty, the queued state is cleared otherwise the
next watermark is registered. The sorted list of timestamps from
buffered events is maintained in a priority queue.
AscendingTimestampExtractor is
a timestamp assigner and watermark generator for streams where
timestamps are monotonously ascending. The timestamps continuously increase for
data such as log files. The local watermarks are easily assigned because
they follow the strictly increasing timestamps which are
periodic.
Microsoft
Azure Stream analytics also follows a timeline for events. There are two
choices – arrival time and application/event time. It bases its watermark on the largest event
time the service has seen minus the out of order tolerance window size. If
there are no incoming event, the watermark is the current estimated arrival
time minus the late arrival tolerance window.
This can only be estimated because the real arrival time is on the data forwarders
such as Event Hubs. The design serves two additional purposes other than
generating watermarks – the system generates results in a timely fashion with
or without incoming events and the system behavior needs to be repeatable.
Since the data forwarder guarantees continuously increasing streaming data, the
service disregards configurations for out-of-order tolerance and late arrival
tolerance when analytics applications choose arrival time as event time.
No comments:
Post a Comment