PostgreSQL CDC connector (public preview) - Realtime Compute for Apache Flink

The PostgreSQL change data capture (CDC) connector is used to read existing data and changed data from a PostgreSQL database. The process is divided into two phases based on the data type: full scan phase and incremental capture phase. If a failure occurs, the connector recovers data in an exactly-once manner to ensure that no data is duplicated or lost. This topic describes how to use the PostgreSQL CDC connector.

Background information

The following table describes the capabilities of the PostgreSQL CDC connector.

Item	Description
Table type	Source table Note You can use the Java Database Connectivity (JDBC) connector to create a sink or dimension table.
Running mode	Streaming mode
Data format	N/A
Monitoring metrics	currentFetchEventTimeLag: the interval between data generation and data fetching to the source operator. currentEmitEventTimeLag: the interval between data generation and data departure from the source operator. sourceIdleTime: the duration from the most recent data generation time in the CDC source to the current time. Note The currentFetchEventTimeLag and currentEmitEventTimeLag metrics are valid only in the incremental capture phase. The metric values are 0 in the full scan phase. For more information about the metrics, see Metrics.
API type	SQL API
Data updates or deletion in a sink table	N/A

Features

The PostgreSQL CDC supports the incremental snapshot reading feature in Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.6 or later. When the connector is run, it first initiates the full scan phase to read the snapshot of a PostgreSQL database. After the full scan is completed, the connector automatically switches to the incremental capture phase to read write-ahead log (WAL) files of the database. If a failure occurs, the connector recovers data in an exactly-once manner to prevent data duplication or loss.

You can use the connector to create a source table, which provides the following advantages:

Supports reading full and incremental data to facilitate unified stream and batch processing.
Supports parallel reading in the full scan phase, which improves reading performance.
Supports automatic and seamless switching from the full scan phase to the incremental capture phase to reduce resource consumption.
Supports data resumption in the full scan phase to enhance stability.
Supports lock-free reading in the full scan phase to ensure normal business operation.

Prerequisites

The PostgreSQL CDC connector uses the logical replication of PostgreSQL to read data changes from ApsaraDB RDS for PostgreSQL, Amazon RDS for PostgreSQL, and self-managed PostgreSQL databases.

Important

Related configurations may be different in ApsaraDB RDS for PostgreSQL, Amazon RDS for PostgreSQL, and self-managed PostgreSQL databases. Make sure that you complete the required configurations. For more information, see Configure a PostgreSQL database.

After you complete the related configurations, make sure that the following conditions are met:

The wal_level parameter is set to logical, which specifies that the information required for logical decoding is included in WAL files.
The REPLICA IDENTITY parameter of the subscribed table is set to FULL to ensure data consistency.
Note
REPLICA IDENTITY is a PostgreSQL-specific table-level parameter. The parameter specifies whether to include the previous values of the involved table columns in the INSERT and UPDATE events that are generated by the logical decoding plug-in. For more information, see REPLICA IDENTITY.
The max_wal_senders and max_replication_slots parameters are set to a value greater than the number of occupied replication slots in the monitored database and the number of required replication slots for the Flink deployment.
Your database account is assigned a superuser role or granted the LOGIN and REPLICATION permissions. Your database account is also granted the SELECT permission on the subscribed table for full data query.

Limits

The incremental snapshot reading feature of the PostgreSQL CDC connector requires Realtime Compute for Apache Flink that uses VVR 8.0.6 or later.
You must manually manage replication slots to avoid disk space waste.
Replication slots are used to retain WAL segments. If a Flink job is restarted, the job can resume from the most recent checkpoint based on the WAL segments, thereby preventing data loss. If you confirm that a Flink job no longer requires a restart, we recommend that you manually delete the corresponding replication slots to free up the occupied resources. In addition, if the progress marker of a replication slot remains stagnant for a long period of time, PostgreSQL retains the WAL records after the marked position. The retention of unnecessary WAL records may result in a significant increase in disk space usage.

If you enable the incremental snapshot reading feature of the PostgreSQL CDC connector, you must enable checkpointing for the connector and define a primary key in the source table. When the incremental snapshot reading feature is enabled, multiple temporary replication slots are created to achieve parallel reading in the full scan phase.
The number of required replication slots in the full scan phase is equal to Number of sources × Parallelism + 1. After the PostgreSQL CDC connector switches from the full scan phase to the incremental capture phase, the system automatically removes temporary replication slots and retains only one replication slot. If you want to limit the number of replication slots used in the full scan phase, you can reduce the parallelism. Take note that this also reduces the reading speed. If the downstream operator or storage supports idempotence, you can use the following configuration to skip log reading in the full scan phase: scan.incremental.snapshot.backfill.skip = true. In this case, the connector provides only at-least-once guarantees. If you use SQL to perform operations such as aggregation and association, we recommend that you do not skip log reading in the full scan phase.

If you disable the incremental snapshot reading feature, the PostgreSQL CDC connector cannot perform checkpoints during the full scan phase.

If the feature is disabled and checkpoints are triggered during the full scan phase, the deployment may perform a failover due to a checkpoint timeout. To prevent the issue, we recommend that you add the following configurations to the Other Configuration field in the Parameters section of the Configuration tab. For more information, see Management and operations for workspaces and namespaces.

execution.checkpointing.interval: 10min
execution.checkpointing.tolerable-failed-checkpoints: 100
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647

The following table describes the parameters in the preceding configurations.

Parameter	Description	Notes
execution.checkpointing.interval	The interval at which checkpoints are triggered.	Data type: DURATION. Example: 10min and 30s.
execution.checkpointing.tolerable-failed-checkpoints	The maximum number of checkpoint failures that are allowed.	The product of the value of this parameter and the checkpoint scheduling interval is the maximum duration of the full scan phase. Note If the scanned table is excessively large, we recommend that you specify a sufficiently large value.
restart-strategy	The restart policy.	Valid values: fixed-delay: The job is restarted with a fixed delay. failure-rate: The job is restarted based on the failure rate. exponential-delay: The job is restarted with an exponential delay. For more information, see Restart Strategies.
restart-strategy.fixed-delay.attempts	The maximum number of restart attempts when you set the restart-strategy parameter to fixed-delay.	N/A.

Syntax

CREATE TABLE postgrescdc_source (
  shipment_id INT,
  order_id INT,
  origin STRING,
  destination STRING,
  is_arrived BOOLEAN
) WITH (
  'connector' = 'postgres-cdc',
  'hostname' = '<yourHostname>',
  'port' = '5432',
  'username' = '<yourUserName>',
  'password' = '<yourPassWord>',
  'database-name' = '<yourDatabaseName>',
  'schema-name' = '<yourSchemaName>',
  'table-name' = '<yourTableName>'
);

Parameters in the WITH clause

Parameter	Description	Data type	Required	Default value	Notes
connector	The type of the connector.	STRING	Yes	No default value	Set the value to `postgres-cdc`.
hostname	The IP address or hostname of the PostgreSQL database.	STRING	Yes	No default value	N/A.
username	The username that is used to access the PostgreSQL database service.	STRING	Yes	No default value	N/A.
password	The password that is used to access the PostgreSQL database service.	STRING	Yes	No default value	N/A.
database-name	The name of the PostgreSQL database.	STRING	Yes	No default value	The name of the PostgreSQL database.
schema-name	The schema name of the PostgreSQL database.	STRING	Yes	No default value	If you want to read data from multiple schemas, set this parameter to a regular expression.
table-name	The name of the table in the PostgreSQL database.	STRING	Yes	No default value	If you want to read data from multiple tables, you can set this parameter to a regular expression.
port	The port that is used to access the PostgreSQL database service.	INTEGER	No	5432	N/A.
decoding.plugin.name	The name of the logical decoding plug-in that is installed in the PostgreSQL database service.	STRING	No	decoderbufs	Valid values: decoderbufs: This plug-in is supported and must be installed in PostgreSQL 9.6 or later. This is the default value. pgoutput: This plug-in is the built-in plug-in in PostgreSQL 10 or later. This is the recommended value.
slot.name	The name of the logical decoding slot.	STRING	Yes (applies to VVR 8.0.1 or later)	flink (applies to VVR versions earlier than 8.0.1) or N/A (applies to VVR 8.0.1 or later)	We recommend that you configure the `slot.name` parameter for each table to prevent the following error: `PSQLException: ERROR: replication slot "debezium" is active for PID 974`.
debezium.*	The Debezium properties.	STRING	No	No default value	This parameter is used to achieve fine-grained control over the Debezium client. Example: `'debezium.snapshot.mode' = 'never'`, which specifies that the connector does not generate snapshots. For more information, see Connector properties.
scan.incremental.snapshot.enabled	Specifies whether to enable the incremental snapshot reading feature.	BOOLEAN	No	false	Valid values: false (default): disables the incremental snapshot reading feature. true: enables the incremental snapshot reading feature. Note This experimental feature is supported only in Realtime Compute for Apache Flink that uses VVR 8.0.6 or later. For more information about the feature, see the Features, Prerequisites, and Limits sections of this topic.
scan.startup.mode	The startup mode that is used to consume data.	STRING	No	initial	Valid values: initial (default): performs a full scan on the monitored database during the first startup and then continues to read the most recent WAL files. latest-offset: skips the full scan on the monitored database during the first startup and reads from the end of the WAL files. In this case, only the most recent changes after the connector is started are recorded. snapshot: performs a full scan on the monitored database, reads the generated WAL files during the full scan phase, and then stops reading.
changelog-mode	The changelog mode that is used to encode streaming changes.	STRING	No	all	Valid values: ALL: encodes all types of changes, including INSERT, DELETE, UPDATE_BEFORE, and UPDATE_AFTER changes. UPSERT: encodes only upsert-type changes, including INSERT, DELETE, and UPDATE_AFTER changes.
heartbeat.interval.ms	The interval at which heartbeat packets are sent.	DURATION	No	30s	Unit: millisecond. The PostgreSQL CDC connector sends heartbeat packets to the monitored database to synchronize the progress offset of replication slots. If table changes are not frequent, configure this parameter to remove unnecessary WAL files at the earliest opportunity.
scan.incremental.snapshot.chunk.key-column	The column that is used to split tables into chunks in the full scan phase.	STRING	No	No default value	By default, the first column of the primary key is used.
scan.incremental.close-idle-reader.enabled	Specifies whether to close the idle readers at the end of the full scan phase.	BOOLEAN	No	false	This parameter is valid only if you set the `execution.checkpointing.checkpoints-after-tasks-finish.enabled` parameter to true.
scan.incremental.snapshot.backfill.skip	Specifies whether to skip log reading in the full scan phase.	BOOLEAN	No	false	Valid values: true: skips log reading. In this case, the connector reads logs from the low watermark in the incremental capture phase. If the downstream operator or storage supports idempotence, we recommend that you set this parameter to true. This reduces the number of required replication slots and provides at-least-one instead of exactly-once guarantees. false: does not skip log reading. In this case, the connector reads the logs between the low watermark and high watermark to ensure consistency. If you use SQL to perform operations such as aggregation and association, we recommend that you do not skip log reading in the full scan phase.

Data type mappings

The following table describes the mapping between data types in PostgreSQL and Realtime Compute for Apache Flink.

Data type in PostgreSQL	Data type in Realtime Compute for Apache Flink
SMALLINT	SMALLINT
INT2
SMALLSERIAL
SERIAL2
INTEGER	INT
SERIAL	INT
BIGINT	BIGINT
BIGSERIAL	BIGINT
REAL	FLOAT
FLOAT4	FLOAT
FLOAT8	DOUBLE
DOUBLE PRECISION	DOUBLE
NUMERIC(p, s)	DECIMAL(p, s)
DECIMAL(p, s)	DECIMAL(p, s)
BOOLEAN	BOOLEAN
DATE	DATE
TIME [(p)] [WITHOUT TIMEZONE]	TIME [(p)] [WITHOUT TIMEZONE]
TIMESTAMP [(p)] [WITHOUT TIMEZONE]	TIMESTAMP [(p)] [WITHOUT TIMEZONE]
CHAR(n)	STRING
CHARACTER(n)
VARCHAR(n)
CHARACTER VARYING(n)
TEXT
BYTEA	BYTES

Sample code

CREATE TABLE source (
  id INT NOT NULL,
  name STRING,
  description STRING,
  weight DECIMAL(10,3)
) WITH (
  'connector' = 'postgres-cdc',
  'hostname' = '<host name>',
  'port' = '<port>',
  'username' = '<user name>',
  'password' = '<password>',
  'database-name' = '<database name>',
  'schema-name' = '<schema name>',
  'table-name' = '<table name>'
);

SELECT * FROM source;

References

For information about the connectors that are supported by Realtime Compute for Apache Flink, see Supported connectors.
For more information about how to write data to a PolarDB for Oracle 1.0 result table, see PolarDB for Oracle 1.0 connector.
You can use the MySQL connector to read data from or write data to ApsaraDB RDS for MySQL, PolarDB for MySQL, and self-managed MySQL databases. For more information, see MySQL connector.