Sunday, November 24, 2019

Flink Savepoint versus checkpoint
Frequently Apache Flink software users find the tasks they submit to the Flink runtime to run long. This happens when the data that these tasks process is very large. There are usually more than one tasks that perform the same operation by dividing the data among themselves. These tasks do not perform in batch operations. They are most likely performing stream-based operations handling a window at a time. The state of the Flink application can be persisted with savepoints. This is something that the application can configure the tasks before they run.  Tasks are usually run as part of Flink ‘jobs’ depending on the parallelism set by the application. The savepoint is usually stored as a set of files under the <base>/.flink/savepoint folder
Savepoints can be manually generated from running link jobs with the following command:
# ./bin/flink savepoint <jobID>
The checkpoints are system generated. The do not depend on the user. The checkpoints are also written to the file system much the same way as savepoints. Although they use different names and are meant to be triggered by user versus system, they share the system method to persist state for any jobs as long as those jobs are enumerable.
The list of running jobs can be seen from the command line with
# ./bin/flink list –r
where the –r option is an option to list the running jobs
Since the code to persist savepoint and checkpoints are the same, the persistence of the state needs to follow the policies of the system which include a limit on the number of states an application can persist.
This implies that the manual savepointing will fail if the limits are violated.  This results in all savepoint triggers to fail. The configuration of savepointing and checkpointing is described by the Apache Flink documentation. Checkpointing may also be possible with the reader group for the stream store. It's limits may need to be honored as well.
User has syntax from Flink such as savepoints to interpret progress.  Users can create, own or delete savepoints which represents the execution state of a streaming job. These savepoints point to actual files on the storage. Flink provides checkpointing which it creates and deletes without user intervention.
While checkpoints are focused on recovery, much more lightweight than savepoints, and bound to the job lifetime, they can become equally efficient diagnostic mechanisms
The detection of failed savepoints is clear from both the jobManager logs as well as the taskManager logs. Shell execution of the flink command for savepoints is run from the jobManager where the flink binary is available. The logs are conveniently available in the log directory.

No comments:

Post a Comment