All Products
Search
Document Center

PolarDB:WAL parallel replay

Last Updated:Jan 27, 2026

This topic describes the write-ahead log (WAL) parallel replay feature for PolarDB for PostgreSQL and .

Applicability

This feature is available for the following versions of PolarDB for PostgreSQL:

  • PostgreSQL 18 with minor engine version 2.0.18.0.1.0 or later.

  • PostgreSQL 17 with minor engine version 2.0.17.2.1.0 or later.

  • PostgreSQL 16 with minor engine version 2.0.16.3.1.1 or later.

  • PostgreSQL 15 with minor engine version 2.0.15.7.1.1 or later.

  • PostgreSQL 14 with minor engine version 2.0.14.5.1.0 or later.

  • PostgreSQL 11 with minor engine version 2.0.11.9.17.0 or later.

Note

You can view the minor engine version in the PolarDB console or by running the SHOW polardb_version; statement. If your cluster's minor engine version does not meet the requirements, you must upgrade the minor engine version.

Background information

A PolarDB for PostgreSQL or cluster uses a one-writer, multiple-reader architecture. On a running read-only node (replica node), the LogIndex background worker process and a backend process use LogIndex data to replay WAL records in separate buffers. This method achieves the parallel replay of WAL records.

WAL replay is critical for the high availability (HA) of PolarDB clusters. Applying the parallel replay method to the standard log replay path is an effective optimization.

Parallel WAL replay offers advantages in at least three scenarios:

  • The crash recovery process for the primary database, read-only nodes, and secondary databases.

  • The continuous replay of WAL records by the LogIndex BGW process on a read-only node.

  • The continuous replay of WAL records by the Startup process on a secondary database.

Terms

  • Block: A data block.

  • WAL: Write-Ahead Logging.

  • Task Node: A subtask execution node in the parallel execution framework that can receive and execute one subtask.

  • Task Tag: A classification identifier for a subtask. Subtasks with the same tag must be executed in sequence.

  • Hold List: A linked list that each child process in the parallel execution framework uses to schedule and execute replay subtasks.

How it works

  • Overview

    A single WAL record may modify multiple data blocks. The WAL replay process can be defined as follows:

    1. Assume that the i-th WAL record has an LSN of LSN<sub>i</sub>​ and modifies m data blocks. The list of data blocks modified by the i-th WAL record is represented as Block<sub>i</sub>​=[Block<sub>i,0</sub>​,Block<sub>i,1</sub>​,...,Block<sub>i,m</sub>​].

    2. The smallest replay subtask is defined as Task<sub>i,j​</sub>=LSN<sub>i​</sub>−>Block<sub>i,j</sub>​. This subtask represents replaying the i-th WAL record on the data block Block<sub>i,j</sub>​.

    3. Therefore, a WAL record that modifies m blocks can be represented as a collection of m replay subtasks: TASK<sub>i,∗</sub>​=[Task<sub>i,0</sub>​,Task<sub>i,1</sub>​,...,Task<sub>i,m​</sub>].

    4. Furthermore, multiple WAL records can be represented as a series of replay subtask collections: TASK<sub>∗,∗</sub>​=[Task<sub>0,∗</sub>​,Task<sub>1,∗</sub>​,...,Task<sub>N,∗</sub>​].

    In the collection of replay subtasks Task<sub>∗,∗</sub>​, the execution of a subtask does not always depend on the result of the preceding subtask.

    Assume the collection of replay subtasks is TASK<sub>∗,∗</sub>​=[Task<sub>0,∗</sub>​,Task<sub>1,∗​</sub>,Task<sub>2,∗</sub>​], where:

    • Task<sub>0,∗</sub>​=[Task<sub>0,0</sub>​,Task<sub>0,1</sub>​,Task<sub>0,2</sub>​]

    • Task<sub>1,∗</sub>​=[Task<sub>1,0</sub>​,Task<sub>1,1</sub>​]

    • Task<sub>2,∗</sub>​=[Task<sub>2,0</sub>​]

    Assume that Block0,0​=Block1,0​, Block0,1​=Block1,1​, and Block0,2​=Block2,0​.

    Then, there are three collections of subtasks that can be replayed in parallel: [Task0,0​,Task1,0​], [Task0,1​,Task1,1​], and [Task0,2​,Task2,0​].

    In summary, many subtask sequences within the set of replay subtasks can be executed in parallel without affecting the consistency of the final result. PolarDB leverages this concept in its parallel task execution framework, which is applied to the WAL replay process.

  • Parallel task execution framework

    A segment of shared memory is divided equally based on the number of concurrent processes. Each segment acts as a circular queue and is allocated to one process. You can configure the depth of each circular queue by setting a parameter.共享内存分配

    • Dispatcher process.

      • Controls concurrent scheduling by dispatching tasks to specified processes.

      • Removes completed tasks from the queue.

    • Process group.

      Each process in the group retrieves a task from its corresponding circular queue and executes it based on the task's state.进程组

    • Tasks

      A circular queue consists of Task Nodes. Each Task Node has one of five states: Idle, Running, Hold, Finished, or Removed.

      • Idle: The Task Node is not assigned a task.

      • Running: The Task Node is assigned a task and is either waiting for execution or is being executed.

      • Hold: The task in the Task Node has a dependency on a preceding task and must wait for that task to finish.

      • Finished: All processes in the process group have completed the task.

      • Removed: When the Dispatcher process determines that a task is in the Finished state, all its prerequisite tasks must also be Finished. The Dispatcher process then changes the task's state to Removed. This state indicates that the Dispatcher process has deleted the task and its prerequisites from the management struct. This ensures that the Dispatcher process handles the results of dependent tasks in the correct order.

      任务In the state machine transitions shown above, the transitions marked by black lines are performed by the Dispatcher process. The transitions marked by orange lines are performed by the parallel replay process group.

    • Dispatcher process

      The Dispatcher process uses three key data structures: Task HashMap, Task Running Queue, and Task Idle Nodes.

      • Task HashMap: Records the hash mapping between Task Tags and their corresponding task execution lists.

        • Each task has a specific Task Tag. If two tasks have a dependency, they share the same Task Tag.

        • When a task is dispatched, if the task has a prerequisite, its state is marked as Hold. The task must wait for its prerequisite to be executed.

      • Task Running Queue: Records the tasks that are currently being executed.

      • Task Idle Nodes: Records the Task Nodes that are currently in the Idle state for different processes in the process group.

      The Dispatcher uses the following scheduling policies:

      • If a task with the same Task Tag as the new task is already running, the new task is preferably assigned to the process that is handling the last task in that Task Tag's linked list. This policy executes dependent tasks on the same process to reduce the overhead of inter-process synchronization.

      • If the preferred process's queue is full, or if no task with the same Task Tag is running, a process is selected in sequence from the process group. An Idle Task Node is then retrieved from that process's queue to schedule the task. This policy distributes tasks as evenly as possible across all processes.

      调度策略

    • Process Group

      This parallel execution applies to tasks of the same type that share the same Task Node data structure. During process group initialization, SchedContext is configured to specify the function pointers that execute specific tasks:

      • TaskStartup: Performs initialization before a process executes tasks.

      • TaskHandler: Executes a specific task based on the incoming Task Node.

      • TaskCleanup: Performs cleanup before a process exits.

      进程组1

      A process in the process group retrieves a Task Node from the circular queue. If the Task Node's state is Hold, the process inserts it at the tail of the Hold List. If the state is Running, the process calls TaskHandler to execute the task. If TaskHandler fails, the system sets a retry count for the Task Node (the default is 3) and inserts the Task Node at the head of the Hold List.Process group 2

      The process searches the Hold List from the beginning for an executable task. If a task's state is Running and its wait count is 0, the process executes the task. If a task's state is Running but its wait count is greater than 0, the process decrements the wait count by 1.Process group 3

  • WAL parallel replay

    LogIndex data records the mapping between Write-Ahead Logging (WAL) records and the data blocks they modify. It also supports retrieval by LSN. During the continuous replay of WAL records on a standby node, PolarDB uses a parallel execution framework. This framework uses LogIndex data to parallelize WAL replay tasks, which accelerates data synchronization on the standby node.

    Workflow

    • Startup process: Parses WAL records and builds LogIndex data without replaying the WAL records.

    • LogIndex BGW replay process: Acts as the Dispatcher process in the parallel execution framework. This process uses LSNs to retrieve LogIndex data, builds replay subtasks, and assigns them to the parallel replay process group.

    • Processes in the parallel replay process group: Execute replay subtasks and replay a single WAL record on a data block.

    • Backend process: When reading a data block, the process uses the PageTag to retrieve LogIndex data. It obtains the linked list of LSNs for the WAL records that modified the block and then replays the entire WAL record chain on the data block.工作流程

    • The Dispatcher process uses LSNs to retrieve LogIndex data. It enumerates PageTags and their corresponding LSNs in their LogIndex insertion order to build {LSN -> PageTag} mappings, which serve as Task Nodes.

    • The PageTag serves as the Task Tag for the Task Node.

    • The Dispatcher process dispatches the enumerated Task Nodes to the child processes in the parallel execution framework's process group for replay.工作流程

Usage guide

To enable the WAL parallel replay feature, add the following parameter to the postgresql.conf file on the standby node.

polar_enable_parallel_replay_standby_mode = ON