Cluster computing: Application troubleshooting continued

Tuesday, June 16, 2020

Application troubleshooting continued

The metrics exposed by Flink include:

Flink supports Counters, Gauges, Histograms and Meters. A counter is used to count something and can be incremented or decremented by a step function. It is registered on a metric group.

A gauge provides a value of any type on demand. Since it only returns a value and for any given type, it can be registered on the metric group. The reporters will turn the exposed object into a String.

The Histogram measures the distribution of values. There is no default implementation of one in Flink but it is available from flink-metrics-dropwizard dependency. A meter measures an average throughput and can be registered on the metric group. Every metric is assigned an identifier and a set of key-value pairs under which the metric will be reported.

The system level metrics for jobManager include the following:

• numRegisteredTaskManagers

• numRunningJobs

• taskSlotsAvailable

• TaskSlotsTotal

all of which are available as type gauge.

The RocksDB metrics are not available by default.

Those for checkpoint involve

• lastCheckpointDuration

• lastCheckpointSize

• lastCheckpointExternalPath

• lastCheckpointRestoreTimestamp

• lastCheckpointAlignmentBuffered

• numberOfInProgressCheckpoints

• numberOfCompletedCheckpoints

• numberOfFailedCheckpoints

• totalNumberOfCheckpoints

And are available only on the jobManager

Each of the connectors such as Kafka and Kinesis connectors have their own metrics. This can be emulated in the Pravega connector as well.

Long Running jobs can make use of metrics on connectors, checkpoints, system resources, and Latency

Cluster computing

Tuesday, June 16, 2020

Application troubleshooting continued

No comments:

Post a Comment