Cluster computing

Saturday, August 10, 2019

The queries on streams are also different depending on the stack involved. For example, when the Flink application is standalone and a query has been provided, the user may receive no output for a long while. When the data set is large, the delay might be confusing to the user on whether it comes from the processing time over the data set or whether the logic was incorrectly written. The native web interface for the Apache Flink provides some support in this regard. It gives the ability to watch for watermarks which can indicate whether there is any progress made. If there are no watermarks then it is likely that the event time windows never elapsed. User has syntax from Flink such as savepoints to interpret progress. Users can create, own or delete savepoints which represents the execution state of a streaming job. These savepoints point to actual files on the storage. Flink provides checkpointing which it creates and deletes without user intervention.
On the other hand, some streaming queries generally follow a five-step procedure:
1) define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis
2) define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval
3) Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query
4) Bind the query to a consumer. This could be to a console. For example
Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();
5) Run the query and evaluate it based on time.
It is probably most succinct in SQL where a windowing function can be written as
SELECT COUNT(*)
OVER ( PARTITION BY hash(u.timestamp DIV (60*60*24)) partitions 3 ) u1
FROM graphupdate u;

Cluster computing

Saturday, August 10, 2019

No comments:

Post a Comment