Cluster computing: Application troubleshooting continued

Wednesday, June 17, 2020

Application troubleshooting continued

The metrics exposed by Flink include:
Flink supports Counters, Gauges, Histograms and Meters. A counter is used to count something and can be incremented or decremented by a step function. It is registered on a metric group.
A gauge provides a value of any type on demand. Since it only returns a value and for any given type, it can be registered on the metric group. The reporters will turn the exposed object into a String.

The Histogram measures the distribution of values. There is no default implementation of one in Flink but it is available from flink-metrics-dropwizard dependency. A meter measures an average throughput and can be registered on the metric group. Every metric is assigned an identifier and a set of key-value pairs under which the metric will be reported.

Each of the connectors such as Kafka and Kinesis connectors have their own metrics. This can be emulated in the Pravega connector as well.

Long Running jobs can make use of metrics on connectors, checkpoints, system resources, and Latency

If we choose the metrics for typical troubleshooting scenarios like long running applications, it becomes easier to investigate the case. Other forms of diagnostics such as logs and events may not always be available or may not support the kind of querying that us easy from metrics dashboard, so having these comes handy.

The metrics are also available to view via REST API. This makes it convenient to use them dynamically from a variety of applications either for one time analysis or recurring. These APIs are available from both the taskManager and jobManager. Each job also has its set of metrics. The request metrics can be aggregated over a subset of all entities with just a mention in query parameters.

The taskManagers and the jobManager can both list their metrics. When the metrics are aggregated the json will usually be an array of metrics

Metrics gathered for each tasks or operator can be visualized in the dashboard. Task metrics are will have a task index and metric name. The operator metrics are listed with name of the operator in addition to subtask and metric name.

The set of metrics that can be visualized in a dashboard can be of any number since there are no restrictions but typically we need one to form a hypothesis and another to test it with.

Cluster computing

Wednesday, June 17, 2020

Application troubleshooting continued

No comments:

Post a Comment