Joke Collection Website - Mood Talk - On fault-tolerant mechanism

On fault-tolerant mechanism

Flink fault-tolerant mechanism is mainly about state saving and recovery, involving state backend, checkpoint and save point, and error recovery of jobs and tasks.

Flink status backend refers to the container for storing checkpoint data, which can be divided into MemoryStateBackend, FsStateBackend and RocksDBStateBackend, and the state can be divided into operator state and keyed state.

Flink state preservation and recovery mainly rely on checkpoint mechanism and savepoint mechanism, and the difference between them is shown in the following table.

The concept of snapshot comes from photos and refers to a photo studio with a short development process. In the computer field, a snapshot is a state record of data stored at a certain moment. Flink snapshot refers to a globally consistent record of job status. A complete snapshot includes the state of the source operator (such as the offset of kafka partition), the cached data of the state operator and the state of the sink operator (bulk cached data, transaction data, etc.). ).

Checkpoint Checkpoint can automatically generate a snapshot for Flink fault recovery. Checkpoints are distributed, asynchronous and incremental.

Savepoint Savepoint is manually triggered by the user to save all job status data. Common usage scenarios include job upgrade, concurrent job scaling, cluster migration, etc.

Flink is a lightweight distributed asynchronous snapshot, which is realized by using barrier as the transmission signal of checkpoint and passing it indiscriminately like business data. The purpose is to cut the data stream into micro batches and save the checkpoints as snapshots. When barrier passes through the flow graph node, Flink performs a checkpoint to save the state data.

As shown in the figure below, checkpoint N contains the status of each operator, which refers to all events before checkpoint N, but excludes all events after it.

Flink provides three semantics to solve the problem that the results are lost or duplicated due to the failure of the user's job:

① At least once: data will not be lost, but there may be duplicate results.

② One-time accuracy: The checkpoint barrier alignment mechanism can ensure one-time accuracy.

① Failurabestartstrategy: allows the maximum number of failures within a specified time interval, and can also set the restart delay time.

② FixedDelayRestartStrategy: Allow the specified number of failures and set the restart delay time.

③ NoRestartStrategy: No restart is required, that is, the job fails directly.

④ Throwing RestateArticle: You don't need to restart, just throw an exception.

You can set the job restart policy through env.

The parent interface of the above strategy is RestartStrategy, and the key is restart.

① RestartAllStrategy: Restart all tasks, the default strategy.

② RestartIndividualStrategy: resume a single task. If the task has no source, it may cause data loss.

③ Noopfailover Strategy: Do not resume the task.

The parent interface of the above strategy is FailoverStrategy, and its main points are Factory's create strategy and onTaskFailure.

How to generate a reliable global consistent snapshot is the difficulty of distributed systems. Its traditional scheme is to use global clock, but there are reliability problems such as single point of failure and inconsistent data. In order to solve this problem, Chandy-Lamport algorithm uses tag propagation instead of global clock.

(1) Process Pi records its own process state, and at the same time generates an identification information tag (different from normal message) and sends it to other processes in the system through the output channel.

② Process Pi starts to record all messages received by the input channel.

The process Pj receives the tag from the input channel Ckj. If Pj does not record its own process state, Pj records its own process state and sends a tag; To the output channel; Otherwise, Pj is recording its own process status (message before marking).

All processes receive tag information and record their own status and channel status (including messages).

Flink's distributed asynchronous snapshot realizes Chandy Lampport algorithm, and its core idea is to insert a barrier in the source code to replace the marker in Chandy Lampport algorithm, and realize snapshot backup and exact semantics by controlling the synchronization of barrier.

The checkpoint coordinator triggers checkpoints to all source nodes.

The source task broadcasts obstacles downstream.

When the source task finishes backing up its own state, it will notify the checkpoint coordinator of the address of the backed-up data.

The mapping and receiver tasks collect obstacles n from upstream sources and take local snapshots. The following example is the process of the incremental checkpoint of RocksDB: First, the RocksDB is completely saved to disk (represented by the big red triangle), and then Flink will select the file that has not been uploaded for persistent backup (the small purple triangle).

Map and sink tasks return the state address state handle to the coordinator after checkpoint.

When the checkpoint coordinator receives the status handles of all tasks, it determines that the checkpoint has been completed and backs up the checkpoint element (metadata, including the backup address of the checkpoint status data) to the permanent memory.