Online promote - PolarDB - Alibaba Cloud Documentation Center

The online promote feature in PolarDB for PostgreSQL is used to promote a read-only node to the primary node.

Prerequisites

The pg_repack extension is supported by the PolarDB for PostgreSQL clusters running one of the following versions:

PostgreSQL 14 (revision version 14.5.1.0 or later)
PostgreSQL 11 (revision version 1.1.1 or later)

Note

You can execute the following statement to view the revision version of your PolarDB for PostgreSQL cluster:

PostgreSQL 14
```
select version();
```
PostgreSQL 11
```
show polar_version;
```

Background information

PolarDB for PostgreSQL adopts the architecture that consists of one primary node and multiple read-only node based on shared storage. It is different from the primary/secondary architecture of a traditional database in the following ways:

Secondary node: is used by a traditional database. It has separate storage, and synchronizes data with the primary node by using complete WAL logs.
Read-only node: is a replica node used by PolarDB for PostgreSQL. It shares storage with the primary node, and synchronizes data with the primary node by using WAL meta logs.

In a traditional database, the promote operation allows the secondary node to be promoted to the primary node without restart of the database or interruption of data reads and writes. This ensures high availability of the database and reduces the recovery time objective (RTO).

PolarDB for PostgreSQL also requires the ability to promote a read-only node to the primary node. Because read-only nodes are different from secondary nodes on traditional databases, PolarDB for PostgreSQL provides the online promote feature to promote a read-only node to the primary node.

Usage

You can use the pg_ctl utility to promote a read-only node:

pg_ctl promote -D [datadir]

How it works

The following sections describe how the online promote feature works.

Trigger mechanism
PolarDB for PostgreSQL uses the same method as that is used on a traditional database to promote the secondary node. The following operations are performed in the trigger process:
- Call the promote command in the pg_ctl utility to signal the postmaster process, which in turn coordinates with other processes to seamlessly execute the promote operation.
- Define the path of the trigger file in the recovery.conf file. Other components are triggered by generating the trigger file.
Note
The process of promoting a read-only node in PolarDB for PostgreSQL differs from the process of promoting a secondary node in a traditional database in the following key aspects:
- After the read-only node is promoted to the primary node, you must remount the shared storage in read/write mode.
- The read-only node maintains some important control information in memory. Such control information on the primary node is persisted on the shared storage. During the online promote process, such control information is also persisted on the shared storage.
- The read-only node can retrieve data to its memory by replaying logs. During the online promote process, it's essential to determine which data can be written to the shared storage.
- When the read-only node replays WAL logs in the memory, it uses disparate buffer eviction methods and flushing control rules from those for the primary node.
- The subprocesses on the read-only node are implemented differently for the online promote process.
Postmaster process
1. After the postmaster process discovers the trigger file or receives the promote command, it starts the online promote process.
2. Send a SIGTERM signal to all current backend processes.
  Note
  Read-only nodes can still provide data reads after the online promote process starts. However, the data may not be latest. To prevent old data from being read from the new primary node during the switchover, tear down all backend sessions and then start data reads and writes after the Startup process ends.
3. Remount the shared storage in read/write mode.
  Note
  This step requires support from the underlying storage.
4. Send the SIGUSR2 signal to the Startup process to notify it of terminating log replay and handling online promote operations.
5. Send the SIGUSR2 signal to the Polar worker process to notify it of stopping parsing some LogIndex data, because such LogIndex data is only necessary for the read-only node that is running.
6. Send the SIGUSR2 signal to the LogIndex background worker (BGW) process to notify it of handling online promote operations.
The following figure shows the Postmaster process.
Startup process
1. Replay all WAL logs generated by the old primary node and generate LogIndex data.
2. Confirm that the last checkpoint of the old primary node is also performed on the read-only node. This ensures that the data written to the read-only node for the last checkpoint is saved to the storage.
3. Wait the LogIndex BGW process to enter the POLAR_BG_WAITING_RESET state.
4. Copy the local data (such as clogs) of the read-only node to the shared storage.
5. Reset the WAL meta queue memory space, reload slot information from the shared storage, and set the LSN of the LogIndex BGW process to the minimum value among the current consistency LSNs, so that the LogIndex BGW process starts a new replay from this point.
6. Set the node role to primary and set the status of the LogIndex BGW process to POLAR_BG_ONLINE_PROMOTE. Then, the cluster can provide data reads and writes.
The following figure shows the Startup process.

LogIndex BGW process

The LogIndex BGW process has its own state machines and runs based on the state machine throughout its lifecycle. The following table describes the states of the LogIndex BGW process.

State	Description
POLAR_BG_WAITING_RESET	The LogIndex BGW process is in the reset state, notifying other processes that the state machine changes.
POLAR_BG_ONLINE_PROMOTE	The LogIndex BGW process reads LogIndex data, organizes and distributes replay tasks, and uses the parallel replay process group to replay WAL logs. The LogIndex BGW process in this state must replay all LogIndex data before it changes the state. The LogIndex BGW process also determines the LSN for background replay.
POLAR_BG_REDO_NOT_START	Indicates the end of the replay task.
POLAR_BG_RO_BUF_REPLAYING	When the read-only is running, the LogIndex BGW process in this state reads LogIndex data and replays WAL logs in a sequential manner. Each time a replay task is complete, the LogIndex BGW process determines the LSN for background replay.
POLAR_BG_PARALLEL_REPLAYING	The LogIndex BGW process reads LogIndex data, organizes and distributes replay tasks, and uses the parallel replay process group to replay WAL logs. Each time a replay task is complete, the LogIndex BGW process determines the LSN for background replay.

The following figure shows the states of the LogIndex BGW process.

LogIndex BGW进程处理过程

After the LogIndex BGW process receives the SIGUSR2 signal from the Postmaster process, it performs the following online promote operations:

Save all LogIndex data to the storage and change the status to POLAR_BG_WAITING_RESET.
Wait for confirming that the Startup process enters the POLAR_BG_ONLINE_PROMOTE state.
- Before the read-only node is promoted, he LogIndex BGW process replays only the pages in the buffer pool.
- When the read-only node is being promoted, the LogIndex BGW process replays all WAL logs in a sequential manner, and calls MarkBufferDirty to mark the pages as dirty, waiting for the pages to be flushed.
- When a replay task is complete, the LogIndex BGW process determines the LSN for background replay and then changes the status to POLAR_BG_REDO_NOT_START.

Flushing control
Each dirty page has the oldest LSN, which is sequential in the FlushList. This LSN is used to determine the consistency LSN.
After the read-only node is promoted and if the current WAL LSN is set to the oldest LSN of the buffer, a new consistency LSN is set before data of LSNs that are lower than the oldest LSN is saved to the storage.
The online promote feature must solve two issues:
- How to set the oldest LSN for dirty pages during WAL replay on the old primary node?
- How to set the oldest LSN for dirty pages generated by the new primary node?
Note
During the online promote process, PolarDB for PostgreSQL sets the oldest LSN of the dirty pages generated in the preceding two cases to the LSN determined by the LogIndex BGW process. Only when all buffers marked with the same oldest LSN are saved to the storage, the process determines the new LSN.