How Flink Batch Jobs Recover Progress during JobMaster Failover?

Authored by Aliyun's R&D engineer, Li Junrui, this article introduces the newly introduced feature of batch job progress resumption in Flink version 1.

Background

Before Flink 1.20, if the JobMaster (JM) experienced a failure and was terminated, two situations would occur:

If high availability (HA) is disabled, the job will fail.
If HA is enabled, the JobMaster will be automatically restarted (i.e. JM failover). In this case, stream jobs will recover from the last successful checkpoint. However, batch jobs, lacking a checkpointing mechanism, will lose all previous progress and have to start from the beginning. This means a significant regression for long-running batch jobs.

To address this issue, we introduced a progress recovery feature for batch jobs after JM failover in Flink 1.20. This feature aims to enable batch jobs to recover as much progress as possible after a JM failover, thus avoiding the need to re-run finished tasks.

Solutions

To achive this goal, Flink needs to persist the state of the JM to external storage. This way, after a JM failover, Flink can use the persisted state to restore the job to its previous execution progress.

We introduced an event-based JM state recovery mechanism to Flink. In the period of job execution, JM's state change events are written to external persistent storage to ensure that job execution progress can still be retrieved after a JM failover. However, inconsistency may exist between the recorded progress and recoverable progress. For instance, if certain TaskManagers (TMs) are unexpectedly lost during execution, this may render intermediate data results inaccessible. Therefore, Flink also needs to obtain information about the intermediate result data from TMs and the Remote Shuffle Service (RSS) to recalibrate the job execution progress recovery results.

The overall process of this functionality consists of the following stages:

（1）During Job Execution
A JobEventStore will receive and write the JM's state change events to an external file system during job execution. The state change events that need to be recorded can be categorized as follows:

Adaptive Execution Plan Optimization: Flink adaptively optimizes the execution plan of batch jobs based on upstream execution results. If every recovery process relies on upstream results to reconstruct the execution plan, it will incur significant overhead. Therefore, recording these optimization results is crucial for task scheduling and fault tolerance.
Finished Task Information: Store the details of finished tasks to avoid re-run them.
OperatorCoordinator State: The OperatorCoordinator is responsible for coordinating operators and facilitating communication between them. Its state is closely related to data consistency. For example, the SourceCoordinator contains state information regarding which source splits have been processed. Rebuilding the state of this component helps ensure data consistency.
ShuffleMaster State: Flink currently supports RSS, and the Shuffle Master in RSS may host some critial information like metadata for shuffle data. To allow the new JobManager to reuse these intermediate results, it is essential to restore the state of the Shuffle Master.

（2）During JM Failover
In Flink batch jobs, intermediate results are stored on TM and RSS during execution. When a JM failover occurs, both TM and RSS will retain the intermediate results associated with the job and continuously attempt to reconnect to the JM. Once a new JM is set up, TM and RSS will reestablish their connections with the JM, and then they will report the intermediate result data they hold.

（3）Job Execution Progress Recovery After JM Failover

Once the JM restarts, it will reestablish connections with TM and RSS, using the events recorded in the JobEventStore and the intermediate results retained by TM and RSS to rebuild the execution progress of the job.

First, the JM will utilize the events stored in the JobEventStore to restore the execution states of each job vertex in the job. Then, based on the state of the OperatorCoordinator, the JobMaster will recover the unprocessed source data splits to avoid data loss or duplication.

Subsequently, the JM will further adjust the execution progress based on the available intermediate results reported by TM and RSS. If any intermediate result partition is lost but is still needed by downstream tasks, the producer task will be reset and re-executed.

Finally, the job will continue executing from the recovered progress.

Example

Here is an example of progress recovery after a JM failover.

The topology of this batch job is Source -> Map -> Sink. When the job progress reaches the Map vertex, the machine hosting the JM goes offline due to maintenance, resulting in a JM failover.

Subsequently, the HA service will automatically start a new JM process, and the job will enter the RECONCILING state, indicating that the job is in the process of recovering its execution progress.

Once the job recovery is complete, it will enter the RUNNING state.

After accessing the job details page, you can see that the job has recovered to the progress it was at before the JM failover.

How to Enable

To use the state recovery feature for Flink batch jobs, users need to:

Ensure that the cluster high availability is enabled: Flink currently supports two kinds of high availability services, Zookeeper and Kubenetes. For more details, please refer to this Flink documentation.
Configure execution.batch.job-recovery.enabled: true

Additionally, all new sources support progress recovery for batch jobs. To achieve the best recovery results, the Source SplitEnumerator needs to implement the SupportsBatchSnapshot interface. Otherwise, tasks of that source will have to restart after a JM failure unless the entire source stage have finished. Currently, FileSource and HiveSource have already implemented this interface. For more details, please refer to this documentation.

Considering the differences between various clusters and jobs, users can refer to this document to do advanced configuration tuning.

Community

How Flink Batch Jobs Recover Progress during JobMaster Failover?

Background

Solutions

Example

How to Enable

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Message Queue for Apache Kafka

ApsaraDB for SelectDB

ApsaraMQ for RocketMQ