How to fix high CPU usage in ApsaraDB for OceanBase clusters? - PolarDB

This topic describes the Write-Ahead Logging (WAL) parallel replay feature of PolarDB for PostgreSQL (Compatible with Oracle).

Prerequisites

The PolarDB for PostgreSQL (Compatible with Oracle) cluster runs the following engine:

Oracle 2.0 (revision version 2.0.14.1.0 or later)

Note

You can execute the following statement to query the revision version of your PolarDB for PostgreSQL (Compatible with Oracle) cluster:

SHOW polar_version;

Background information

A PolarDB for PostgreSQL (Compatible with Oracle) cluster consists of a primary node and one or more read-only nodes. While a read-only node is handling a read request, a background worker process and a backend process use LogIndex to replay the WAL in different buffers, which achieves parallel replay of WAL records.

Since the replay of WAL records is essential to the high availability of PolarDB clusters, employing this method of parallel log replay in the standard WAL replay process would contribute to system optimization.

Parallel replay of WAL records are useful in the following scenarios:

The crash recovery of the primary node, read-only nodes, and secondary nodes.
The continuous replay of WAL records by a background worker process using LogIndex data on a read-only node.
The continuous replay of WAL records by the startup process running on a secondary node.

Terms

Block: data block.
WAL: write-ahead logging.
Task node: the node where a subtask is executed. A task node receives and executes a subtask at a time.
Task tag: the identifier of a subtask type. Subtasks of the same type are executed in a specific order.
Task hold list: the list used by each child process to schedule replay subtasks in the parallel execution framework.

How parallel replay of WAL records works

Overview
A WAL record may involve changes to multiple data blocks. Assume that the ith WAL entry (whose LSN is LSN_i) records changes to m data blocks. Some concepts are defined as follows:
1. The list of modified data blocks recorded in the ith WAL entry is defined as: Block_i=[Block_i,0, Block_i,1,..., Block_i,m].
2. The minimum replay subtask is defined as: Task_i,j=LSN_i->Block_i,j, indicating the replay of the ith WAL record on Block_i,j.
3. The ith WAL entry that records changes to m data blocks can be represented as a group of m replay subtasks: TASK_i,∗=[Task_i,0, Task_i,1, ..., Task_i,m].
4. Multiple WAL entries are represented as a collection of replay subtask groups: TASK_∗,∗=[Task_0,∗, Task_1,∗, ..., Task_N,∗].
In Task_∗,∗, a subtask does not always depend on its previous subtask.
Assume that the collection of subtask groups is TASK_∗,∗=[Task_0,∗,Task_1,∗,Task_2,∗], where:
- Task_0,∗=[Task_0,0,Task_0,1,Task_0,2]
- Task_1,∗=[Task_1,0,Task_1,1]
- Task_2,∗=[Task_2,0]
and Block_0,0=Block_1,0, Block_0,1=Block_1,1, and Block_0,2=Block_2,0.
Three pairs of replay subtasks can be run in parallel: [Task_0,0,Task_1,0], [Task_0,1,Task_1,1], and [Task_0,2,Task_2,0].
In summary, many subtask groups, which represent WAL records, can be executed in parallel without affecting the consistency of the final results. Based on this idea, PolarDB has introduced the framework for parallel task execution and incorporated this framework into the WAL records replay process.
Framework for parallel task execution
A shared memory is evenly divided based on the number of concurrent processes. Each segment is a circular queue allocated to a process. The depth of the circular queue is configured with parameters.
- Dispatcher process
  - Controls concurrent scheduling by distributing tasks to specified processes.
  - Removes the tasks that have been executed from the circular queue.
- Processes pool
  Each process in the pool retrieves a task from the corresponding circular queue and decides whether to execute the task based on its state.
- Task
  A circular queue consists of task nodes. A task node can be in five states: idle, running, hold, finished, and removed.
  - idle: no task is assigned to the task node.
  - running: the task node has been assigned a task, which is either being executed or to be executed.
  - hold: the task node has been assigned a task that depends on another task and must await its execution.
  - finished: the task on the task node has been executed.
  - removed: the dispatcher process has removed the task and its prerequisite tasks from the circular queue. When a task on a task node is finished, its prerequisite tasks must also be finished, and thus all can be safely removed. This mechanism can ensure that the dispatcher process deals with the execution results of tasks in order of dependency.
  In the preceding figure, state transitions marked in black are executed by the dispatcher process and those marked in orange are implemented by the processes pool.
- Dispatcher process
  The dispatcher process has three essential data structures: Task HashMap, Task Running Queue, and Task Idle Nodes.
  - Task HashMap: records the hash mappings between task tags and tasks.
    - Each task is assigned a task tag. Dependent tasks have the same task tags.
    - If a task on a task node has a prerequisite, the task node will be in the hold state until the prerequisite task is complete.
  - Task Running Queue: records the tasks that are being executed.
  - Task Idel Nodes: records idle task nodes for different processes in the processes pool.
  Dispatcher scheduling strategies:
  - Preferably assigns a task to the process that runs a task with the same task tag and is an immediate prerequisite. The purpose is to reduce the overheads of sync between processes by assigning dependent tasks to the same process.
  - If the preferred process has a full queue or no process is running tasks with the same task tag, the dispatcher chooses a process from the processes pool in sequential order and assigns the task to a task node in idle state. This way, tasks are evenly distributed across different processes.
- Processes pool
  The parallel execution mechanism is used for the tasks with the same task node data structure. SchedContext is configured during the initialization of the processes pool to specify the function pointers for task execution:
  - TaskStartup: the initialization action before the process can execute assigned tasks.
  - TaskHandler: executes the assigned task.
  - TaskCleanup: the cleanup action before the process exits.
  A process in the processes pool retrieves a task node from the circular queue. If the task node is in the hold state, the process inserts the task node to the tail of the hold list. If the task node is in the running state, the process will call TaskHandler. Should this fail, the process will assign a hold count of three by default and prepend this task node to the head of the hold list.
  The process scans the hold list from head to tail. As it finds a running task node, the process checks the task node's hold count. If the hold count is 0, the task is executed. If the hold count is greater than 0, it is reduced by 1. The scan process of the hold list is shown in the following figure.
Parallel replay of WAL records
LogIndex records the mappings between WAL entries and modified data blocks. It can be retrieved with LSNs. PolarDB introduces the framework for parallel task execution when WAL records are continuously replayed on the read-only node. Used along with LogIndex data, parallel replay of the WAL accelerates data synchronization between the primary node and read-only nodes.
Workflow
- The startup process parses the WAL and then builds LogIndex data without replaying the WAL.
- The LogIndex BG Writer process, as the dispatcher process in the parallel replay framework, uses LSNs to retrieve LogIndex data, builds replay subtasks, and assigns them to the processes pool.
- A parallel worker process executes a replay subtask and replays a single WAL record on a data block.
- The backend process, while reading a data block, uses the page tag to retrieve LogIndex data and replays all WAL records mapped to this data block according to LogIndex data. The following figure shows the workflow of parallel replay.
- Dispatcher process: uses LSNs to retrieve LogIndex data, sequentially lists page tags and their respective LSNs, and constructs {LSN -> Page tag} mappings as task nodes.
- The page tag identifies task nodes that have been assigned tasks sharing an identical task tag.
- Each task node is distributed to a parallel worker process in the processes pool for replay.

Configuration

Configure the following parameter in the postgresql.conf file of the read-only node to enable parallel replay of WAL records.

polar_enable_parallel_replay_standby_mode = ON