Transparency in user query execution:
Streaming queries are a new breed. Most applications like Flink require the query logic to be packaged in a module prior to execution. Although a user interface is provided, much of the execution and its errors are latent. Consequently, the user has very limited tools for progress, debugging and troubleshooting in general
For example, when the Flink application is standalone and a query has been provided, the user may receive no output for a long while. When the data set is large, the delay might be confusing to the user on whether it comes from the processing time over the data set or whether the logic was incorrectly written. The native web interface for the Apache Flink provides some support in this regard. It gives the ability to watch for watermarks which can indicate whether there is any progress made. If there are no watermarks then it is likely that the event time windows never elapsed.
Similarly, if the logic requires extract-transform-load of data, there is an increased likelihood of resource consumption and overall performance impact. This might manifest itself by way of myriad symptoms such as error messages and failed executions.
The error messages themselves usually suffer from two problems. One they are not descriptive enough for the user to take immediately resolving actions. And second they don’t generally differentiate between user error and operational error. For example, “an insufficient number of network buffers” does not immediately mean parallelism must be reduced. Another example is when a NotSerializableException does not indicate if the user’s query logic must be changed or if the data is just not good.
The absence of a progress bar on the UI and the requirement that the user follow Flink conventions only makes it more difficult to troubleshoot. User has syntax from Flink such as savepoints to interpret progress. Users can create, own or delete savepoints which represents the execution state of a streaming job. These savepoints point to actual files on the storage. If the access the savepoints becomes restricted or unavailable in some circumstance, the troubleshooting is impaired. Contrast this with Checkpointing which the Flink creates and deletes without user intervention. While checkpoints are focused on recovery, much more lightweight than savepoints, and bound to the job lifetime, they can become equally efficient diagnostic mechanisms
Streaming queries are a new breed. Most applications like Flink require the query logic to be packaged in a module prior to execution. Although a user interface is provided, much of the execution and its errors are latent. Consequently, the user has very limited tools for progress, debugging and troubleshooting in general
For example, when the Flink application is standalone and a query has been provided, the user may receive no output for a long while. When the data set is large, the delay might be confusing to the user on whether it comes from the processing time over the data set or whether the logic was incorrectly written. The native web interface for the Apache Flink provides some support in this regard. It gives the ability to watch for watermarks which can indicate whether there is any progress made. If there are no watermarks then it is likely that the event time windows never elapsed.
Similarly, if the logic requires extract-transform-load of data, there is an increased likelihood of resource consumption and overall performance impact. This might manifest itself by way of myriad symptoms such as error messages and failed executions.
The error messages themselves usually suffer from two problems. One they are not descriptive enough for the user to take immediately resolving actions. And second they don’t generally differentiate between user error and operational error. For example, “an insufficient number of network buffers” does not immediately mean parallelism must be reduced. Another example is when a NotSerializableException does not indicate if the user’s query logic must be changed or if the data is just not good.
The absence of a progress bar on the UI and the requirement that the user follow Flink conventions only makes it more difficult to troubleshoot. User has syntax from Flink such as savepoints to interpret progress. Users can create, own or delete savepoints which represents the execution state of a streaming job. These savepoints point to actual files on the storage. If the access the savepoints becomes restricted or unavailable in some circumstance, the troubleshooting is impaired. Contrast this with Checkpointing which the Flink creates and deletes without user intervention. While checkpoints are focused on recovery, much more lightweight than savepoints, and bound to the job lifetime, they can become equally efficient diagnostic mechanisms
No comments:
Post a Comment