This topic describes how to use the INSERT ON CONFLICT statement in Hologres.
Scenarios
The INSERT ON CONFLICT
statement is suitable for scenarios in which data is written by executing SQL statements.
If you use the Data Integration service of DataWorks or Realtime Compute for Apache Flink to insert data into a table, you can configure a write conflict processing parameter to determine whether to ignore the data that has the same primary key values as existing data or update the existing data.
Use the Data Integration service of DataWorks to write data to Hologres.
The INSERT ON CONFLICT
statement is integrated into Data Integration. For more information about how the statement works, see Hologres Writer. If you need to use Data Integration, you must configure one of the following settings:
If you use a batch synchronization node, set the conflictMode parameter to Ignore or Replace.
If you use a real-time synchronization node, set the conflictMode parameter to Ignore or Replace.
Note
A primary key must be specified for the destination Hologres table. This way, data in the table can be updated during data synchronization.
Use Realtime Compute for Apache Flink to write data to Hologres.
If you use Realtime Compute for Apache Flink to write data to Hologres, the default write conflict policy is InsertOrIgnore
. InsertOrIgnore indicates that data whose primary key values appear for the first time is inserted and data whose primary key values already exist is ignored. You must make sure that a primary key is specified when you create the destination Hologres table. If you use the CREATE TABLE AS SELECT (CTAS)
syntax, the default write conflict policy is InsertOrUpdate
. InsertOrUpdate indicates that existing data in specified fields is replaced by new data.
Syntax
You can use the INSERT ON CONFLICT statement to specify a policy to deal with data whose primary key values already exist when you insert data into a table. The execution result of the INSERT ON CONFLICT statement is similar to that of the UPSERT statement. The INSERT ON CONFLICT
statement uses the following syntax:
INSERT INTO <table_name> [ AS <alias> ] [ ( <column_name> [, ...] ) ]
{ VALUES ( { <expression> } [, ...] ) [, ...] | <query> }
[ ON CONFLICT [ conflict_target ] conflict_action ]
where conflict_target is pk
ON CONSTRAINT constraint_name
and conflict_action is one of:
DO NOTHING
DO UPDATE SET { <column_name> = { <expression> } |
( <column_name> [, ...] ) = ( { <expression> } [, ...] ) |
} [, ...]
[ WHERE condition ]
The following table describes the parameters in the syntax.
Parameter | Description |
table_name | The name of the table into which you want to insert data. |
alias | The substitute name for the table into which you want to insert data. |
column_name | The name of the column in the table into which you want to insert data. |
DO NOTHING | The InsertOrIgnore mode is used. If you insert data whose primary key values already exist, the new data is ignored. |
DO UPDATE | The InsertOrUpdate mode is used. If you insert data whose primary key values already exist, the existing data is updated in one of the following methods: Data in all columns is updated. Only data in specified columns is updated. If you want to achieve the effect of the InsertOrReplace mode to insert the null value into the columns that are missing in the new data, manually insert the null value. For more information, see the "Examples" section in this topic. Important If a column contains a default value, the DO UPDATE parameter does not update the default value, and the performance is low. If you want to achieve the effect of the InsertOrReplace mode by executing the INSERT ON CONFLICT statement, you need to specify the null value in VALUES in the statement. If you use Realtime Compute for Apache Flink or Data Integration to write data to Hologres and specify InsertOrReplace, the null value is automatically inserted.
|
expression | The expression that is executed on the specified column. You can refer to the PostgreSQL documentation to configure the expression. You can use the common expression b=excluded.b or the simplified expression (a, b, c) = ROW (excluded.*) to insert or update data. The left side of the equation indicates the field to be updated. The right side of the equation indicates the inserted expression, which is the VALUES expression or the SELECT expression. excluded is the alias of the inserted expression instead of the alias of the source table. For example, in column_name = excluded.column_name , the column_name parameter specifies the name of the column into which data is inserted. If the column_name parameter specifies the Nth column in the destination table, excluded.column_name indicates the Nth column in the inserted expression. excluded.* specifies to insert data of all columns. You must make sure that the columns in the inserted expression are in the same order as the columns specified in the data definition language (DDL) statement of the destination table. |
How it works
The implementation principle of the INSERT ON CONFLICT statement is the same as that of the UPDATE statement. For more information about the UPDATE statement, see UPDATE. Storage modes are classified into row-oriented storage, column-oriented storage, and row-column hybrid storage. The processing method varies based on the storage mode that a table uses during updates. This leads to different update performance for tables that use different storage modes. INSERT ON CONFLICT supports the update modes InsertOrIgnore
, InsertOrReplace
, and InsertOrUpdate
to meet business requirements. The following table describes the differences among the update modes.
Update mode | Description |
InsertOrIgnore | Discards the data whose primary key values already exist. This mode is implemented by using the insert on conflict do nothing statement. |
InsertOrUpdate | Updates existing data if you insert data whose primary key values already exist. This update mode involves updates of an entire row and updates of specific columns of a row. If the written data does not cover all columns, the columns that are missing in the new data are not updated. This mode is implemented by using the insert on conflict do update statement. |
InsertOrReplace | Replaces existing data if you insert data whose primary key values already exist. If the written data does not cover all columns, the null value is inserted into the columns that are missing in the new data. This mode is implemented by using the insert on conflict do update statement and manual configuration of the null value. |
The update performance varies based on the update mode and storage mode of tables.
Update performance of column-oriented tables varies based on the update mode:
The destination table without a primary key has the highest performance.
If the destination table has a primary key, the update modes are listed in descending order by update performance as InsertOrIgnore > InsertOrReplace ≥ InsertOrUpdate (entire row) > InsertOrUpdate (some columns)
.
Update modes of row-oriented tables are listed in the following descending order by update performance:
InsertOrReplace = InsertOrUpdate (entire row) ≥ InsertOrUpdate (some columns) ≥ InsertOrIgnore
.
Errors and troubleshooting
Problem description
One of the following errors occurs when you execute the INSERT ON CONFLICT
statement to insert data into a table.
Cause 1: The source data contains duplicate rows.
Hologres is compatible with PostgreSQL and allows you to use the standard PostgreSQL syntax. If you use the standard PostgreSQL syntax to execute the INSERT ON CONFLICT
statement to insert data into a table, you must make sure that the source data does not contain duplicate rows. If the source data contains duplicate rows, one of the preceding error messages appears.
Note
Duplication in source data indicates that the source data contains duplicate rows. It does not indicate that the destination table contains the same row as the row to be inserted.
The source data that is specified in the INSERT ON CONFLICT
statement contains duplicate rows. Sample statement:
INSERT INTO test1 VALUES (1, 2, 3), (1, 2, 3)
ON CONFLICT (a)
DO UPDATE SET (a, b, c) = ROW (excluded.*);
Solution:
If the source data contains duplicate rows, run the following command to retain the last duplicate row.
set hg_experimental_affect_row_multiple_times_keep_last = on;
Cause 2: The source table contains duplicate data because the time to live (TTL) expires.
A TTL is configured for the source table. Some data whose TTL expires in the source table is not deleted because the TTL is not exact. This may cause duplicate data in the source table. In this case, if data with duplicate primary key values is written into the destination table, one of the preceding error messages appear.
Solution:
In Hologres V1.3.23 and later, you can use the following statement to fix this issue. After the statement is executed, the system deletes the duplicate primary key values in the table. The default cleanup policy is Keep Last, which indicates that the last row with the duplicate primary key value is retained, and other rows are deleted.
Note
In principle, a table does not contain duplicate primary key values. Therefore, this statement only clears duplicate primary key values that are caused by TTL expiration.
Hologres V1.3.23 and later support this statement. If the version of your Hologres instance is earlier than V1.3.23, upgrade your Hologres instance.
call public.hg_remove_duplicated_pk('<schema>.<table_name>');
Example: The tbl_1
table is the destination table. The tbl_2
table is the source table, and a TTL is configured for this source table. The TTL is set to 300 seconds
. Insert an entire row of data in the tbl_2
table into the tbl_1
table. If the TTL expires and the tbl_2
table contains data with duplicate primary key values, data fails to be written and an error occurs.
BEGIN;
CREATE TABLE tbl_1 (
a int NOT NULL PRIMARY KEY,
b int,
c int
);
CREATE TABLE tbl_2 (
d int NOT NULL PRIMARY KEY,
e int,
f int
);
CALL set_table_property('tbl_2', 'time_to_live_in_seconds', '300');
COMMIT;
INSERT INTO tbl_1 VALUES (1, 1, 1), (2, 3, 4);
INSERT INTO tbl_2 VALUES (1, 5, 6);
INSERT INTO tbl_2 VALUES (1, 3, 6);
INSERT INTO tbl_1 (a, b, c)
SELECT
d,e,f
FROM
tbl_2
ON CONFLICT (a)
DO UPDATE SET
(a,b,c) = ROW (excluded.*);
call public.hg_remove_duplicated_pk('tbl_2');