Cluster computing: Application troubleshooting continued

Sunday, June 28, 2020

Application troubleshooting continued

Application troubleshooting continued:

Ephemeral streams

Stream processing unlike batch processing runs endlessly as and when new events arrive. Applications use state persistence, watermarks, side outputs, and metrics to monitor progress. These collections serve no other purpose beyond the processing and are a waste of resources when there are several processors. If they are disposed off at the end of processing, they become ephemeral in nature.

There is no construct in Flink or stream store client libraries to automatically cleanup at the end of the stream processing. This usually falls on the application to do the cleanup. There are a number of applications that don’t. And these leave behind lots of stale collections. It aggravates when data is copied from system to system or scope to scope.

The best way to tackle auto closure of streams is to tie it to the constructor or destructor and invoking the cleanup from their respective resource providers.

If the resources cannot be scoped because they are shared, then workers are most likely attaching or detaching to these resources. In such cases they can be globally shut down just like the workers at the exit of the process.

User Interface

The streaming applications are a new breed. Unlike time series database that have popular dashboards and their own query language, the dashboards for stream processing are constructed one at a time. Instead it could evolve into a Tableau or visualization library that decouples the rendering from the continuous data source. Even the metrics dashboard or time series data support a notion of windows and historical queries. They become inadequate for stream processing only because of the query language. In some cases, they are proprietary. It would be much better if they can reuse the stream processing language and construct queries that can be fed to active applications already running in the Flink Runtime.

Each query can be written to an artifact, deployed and executed on the FlinkCluster in much the same way as lambda processing does serverless computing except that in this case it is hosted on the FlinkCluster. Also, the query initiation is a one-time cost and its execution will not bear that cost again and again. The execution will result in a continuous manner with windowed results which can be streamed to the dashboard. Making web requests and responses from the application to the dashboard and vice versa is easy because the application has network connectivity.

Query Adapters

Many dashboards from different analysis stacks such as influxdb, SQL, come with their own query language. These queries are not all similar to stream processing but they can be expressed using Apache Flink. These dashboards can be supported if the queries can be translated. The Flink language has the support for within with tables and batches. The querying translation is easier with parametrization and templates for common queries.

Instead of query adapters, it is also possible to import and export data to external stacks that have dedicated dashboards. Also, development environments like Jupyter notebook are also able to promote other languages and dashboards if the data is made available. This data import and export logic is usually encapsulated in data transformation operators that are usually called extract-transform-load operators.

Cluster computing

Sunday, June 28, 2020

Application troubleshooting continued

No comments:

Post a Comment