This topic describes how to use the INSERT ON CONFLICT statement in Hologres.
Scenarios
The INSERT ON CONFLICT
statement is suitable for scenarios in which data is written by executing SQL statements.
If you use the Data Integration service of DataWorks or Realtime Compute for Apache Flink to insert data into a table, you can configure a write conflict processing parameter to determine whether to ignore the data that has the same primary key values as existing data or update the existing data.
Use the Data Integration service of DataWorks to write data to Hologres.
The
INSERT ON CONFLICT
statement is integrated into Data Integration. For more information about how the statement works, see Hologres Writer. If you need to use Data Integration, you must configure one of the following settings:If you use a batch synchronization node, set the conflictMode parameter to Ignore or Replace.
If you use a real-time synchronization node, set the conflictMode parameter to Ignore or Replace.
NoteA primary key must be specified for the destination Hologres table. This way, data in the table can be updated during data synchronization.
Use Realtime Compute for Apache Flink to write data to Hologres.
If you use Realtime Compute for Apache Flink to write data to Hologres, the default write conflict policy is
InsertOrIgnore
. InsertOrIgnore indicates that data whose primary key values appear for the first time is inserted and data whose primary key values already exist is ignored. You must make sure that a primary key is specified when you create the destination Hologres table. If you use theCREATE TABLE AS SELECT (CTAS)
syntax, the default write conflict policy isInsertOrUpdate
. InsertOrUpdate indicates that existing data in specified fields is replaced by new data.
Syntax
You can use the INSERT ON CONFLICT statement to specify a policy to deal with data whose primary key values already exist when you insert data into a table. The execution result of the INSERT ON CONFLICT statement is similar to that of the UPSERT statement. The INSERT ON CONFLICT
statement uses the following syntax:
INSERT INTO <table_name> [ AS <alias> ] [ ( <column_name> [, ...] ) ]
{ VALUES ( { <expression> } [, ...] ) [, ...] | <query> }
[ ON CONFLICT [ conflict_target ] conflict_action ]
where conflict_target is pk
ON CONSTRAINT constraint_name
and conflict_action is one of:
DO NOTHING
DO UPDATE SET { <column_name> = { <expression> } |
( <column_name> [, ...] ) = ( { <expression> } [, ...] ) |
} [, ...]
[ WHERE condition ]
The following table describes the parameters in the preceding syntax.
Parameter | Description |
table_name | The name of the table into which you want to insert data. |
alias | The substitute name for the table into which you want to insert data. |
column_name | The name of the column in the table into which you want to insert data. |
DO NOTHING | The InsertOrIgnore mode is used. If you insert data whose primary key values already exist, the new data is ignored. |
DO UPDATE | The InsertOrUpdate mode is used. If you insert data whose primary key values already exist, the existing data is updated by using one of the following methods:
Important |
expression | The expression that is executed on the specified column. You can refer to the PostgreSQL documentation to configure the expression. You can use the common expression |
How it works
The implementation principle of the INSERT ON CONFLICT statement is the same as that of the UPDATE statement. For more information about the UPDATE statement, see UPDATE. Storage modes are classified into row-oriented storage, column-oriented storage, and row-column hybrid storage. The processing method varies based on the storage mode that a table uses during updates. This leads to different update performance for tables that use different storage modes. INSERT ON CONFLICT supports the update modes InsertOrIgnore
, InsertOrReplace
, and InsertOrUpdate
to meet business requirements. The following table describes the differences among the update modes.
Update mode | Description |
InsertOrIgnore | Discards the data whose primary key values already exist. This mode is implemented by using the |
InsertOrUpdate | Updates existing data if you insert data whose primary key values already exist. This update mode involves updates of an entire row and updates of specific columns of a row. If the written data does not cover all columns, the columns that are missing in the new data are not updated. This mode is implemented by using the |
InsertOrReplace | Replaces existing data if you insert data whose primary key values already exist. If the written data does not cover all columns, the null value is inserted into the columns that are missing in the new data. This mode is implemented by using the |
The update performance varies based on the update mode and storage mode of tables.
Update performance of column-oriented tables varies based on the update mode:
The destination table without a primary key has the highest performance.
If the destination table has a primary key, the update modes are listed in descending order by update performance as
InsertOrIgnore > InsertOrReplace ≥ InsertOrUpdate (entire row) > InsertOrUpdate (some columns)
.
Update modes of row-oriented tables are listed in the following descending order by update performance:
InsertOrReplace = InsertOrUpdate (entire row) ≥ InsertOrUpdate (specific columns) ≥ InsertOrIgnore
.
Limits
In the
INSERT ON CONFLICT
statement, you must specify all primary key columns.When you execute the INSERT ON CONFLICT statement by using the Hologres Query Engine (HQE), data is not written in order. Therefore, you cannot implement the Keep First or Keep Last policy. However, we recommend that you use the Keep Last policy if you want to deduplicate primary key values in the data source. Sample statement:
-- Retain the last row in duplicate rows. set hg_experimental_affect_row_multiple_times_keep_last = on;
Examples
This section describes how to use the
INSERT ON CONFLICT
statement.NoteHologres V2.1.17 and later support the Serverless Computing feature. The Serverless Computing feature is suitable for scenarios in which you want to import a large amount of data offline, run large-scale extract, transform, and load (ETL) jobs, or query a large amount of data from foreign tables. You can use the Serverless Computing feature to perform the preceding operations based on additional serverless computing resources. This can eliminate the need to reserve additional computing resources for the instances. This improves instance stability and reduces the occurrences of out of memory (OOM) errors. You are charged only for the additional serverless computing resources used by tasks. For more information about the Serverless Computing feature, see Overview of Serverless Computing. For more information about how to use the Serverless Computing feature, see User guide on Serverless Computing.
Create a table and insert data into the table.
begin ; create table test1 ( a int NOT NULL PRIMARY KEY, b int, c int ); commit ; insert into test1 values (1,2,3);
Examples in different scenarios:
NoteThe results in the following scenarios are independent of each other and have no sequential relationship. The table that is created in the preceding step is used in the following scenarios.
Scenario 1: Implement the InsertOrIgnore mode. Data is not updated if the inserted primary key value already exists.
INSERT INTO test1 (a, b, c) VALUES (1, 1, 1) ON CONFLICT (a) DO NOTHING; -- After the update, the test1 table contains the following data: a b c 1 2 3
Scenario 2: Implement the InsertOrUpdate mode to update an entire row of data by using one of the following methods:
Method 1: List all columns in the
SET..EXCLUDED
clause.INSERT INTO test1 (a, b, c) VALUES (1, 1, 1) ON CONFLICT (a) DO UPDATE SET b = EXCLUDED.b, c = EXCLUDED.c; -- After the update, the test1 table contains the following data: a b c 1 1 1
Method 2: Use
ROW(EXCLUDED.*)
to update all columns in a row.INSERT INTO test1 (a, b, c)VALUES (1, 1, 1) ON CONFLICT (a) DO UPDATE SET (a,b,c) = ROW(EXCLUDED.*); -- After the update, the test1 table contains the following data: a b c 1 1 1
Scenario 3: Implement the InsertOrUpdate mode to update specified columns. Only data in specified columns is updated.
-- To update specific columns, specify the columns in the DO UPDATE SET clause. INSERT INTO test1 (a, b, c) VALUES (1, 1, 1) ON CONFLICT (a) DO UPDATE SET b = EXCLUDED.b; -- Column c in the table is not updated. After the update, the test1 table contains the following data: a b c 1 1 3
Scenario 4: Implement the InsertOrReplace mode. Insert the null value into columns that are missing in the new data.
-- To implement the InsertOrReplace mode to insert the null value into the columns that are missing in the new data, manually insert the null value. INSERT INTO test1 (a, b,c) VALUES (1, 1,null) ON CONFLICT (a) DO UPDATE SET b = EXCLUDED.b,c = EXCLUDED.c; -- After the update, the test1 table contains the following data: a b c 1 1 \N
Scenario 5: Insert data from the test2 table into the test1 table.
-- Create a table named test2 and insert data into the test2 table. CREATE TABLE test2 ( d int NOT NULL PRIMARY KEY, e int, f int ); INSERT INTO test2 VALUES (1, 5, 6); -- Use the test2 table to overwrite the row in the test1 table that shares the same primary key value with the test2 table. INSERT INTO test1 (a, b, c) SELECT d,e,f FROM test2 ON CONFLICT (a) DO UPDATE SET (a,b,c) = ROW (excluded.*); -- After the update, the test1 table contains the following data: a b c 1 5 6 -- Use the test2 table to overwrite the row in the test1 table that shares the same primary key value with the test2 table. The column mapping is changed. The data in Column e of the test2 table is updated to the column c of the test1 table, and data in Column f of the test2 table is updated to the column b of the test1 table. INSERT INTO test1 (a, b, c) SELECT d,e,f FROM test2 ON CONFLICT (a) DO UPDATE SET (a,c,b) = ROW (excluded.*); -- After the update, the test1 table contains the following data: a b c 1 6 5
This section describes how to optimize the
INSERT ON CONFLICT
statement for row-oriented tables.Hologres optimizes the INSERT ON CONFLICT statement for row-oriented tables. We recommend that you specify columns in the DO UPDATE clause in the same order as those in the INSERT clause and update all columns.
INSERT INTO test1 (a, b, c) SELECT d,e,f FROM test2 ON CONFLICT (a) DO UPDATE SET a = excluded.a, b = excluded.b, c = excluded.c; INSERT INTO test1 (a, b, c) SELECT d,e,f FROM test2 ON CONFLICT (a) DO UPDATE SET(a,b,c) = ROW (excluded.*)
Errors and troubleshooting
Problem description
One of the following errors occurs when you execute the
INSERT ON CONFLICT
statement to insert data into a table.Error Message 1:
duplicate key value violates unique constraint
.Error Message 2:
Update row with Key (xxx)=(yyy) multiple times
.Error Message 3 (OOM errors):
Total memory used by all existing queries exceeded memory limitation
.
Cause 1: The source data contains duplicate rows.
Hologres is compatible with PostgreSQL and allows you to use the standard PostgreSQL syntax. If you use the standard PostgreSQL syntax to execute the
INSERT ON CONFLICT
statement to insert data into a table, you must make sure that the source data does not contain duplicate rows. If the source data contains duplicate rows, one of the preceding error messages appears.NoteDuplication in source data indicates that the source data contains duplicate rows. It does not indicate that the destination table contains the same row as the row to be inserted.
The source data that is specified in the
INSERT ON CONFLICT
statement contains duplicate rows. Sample statement:INSERT INTO test1 VALUES (1, 2, 3), (1, 2, 3) ON CONFLICT (a) DO UPDATE SET (a, b, c) = ROW (excluded.*);
Solution:
If the source data contains duplicate rows, run the following command to retain the last duplicate row.
set hg_experimental_affect_row_multiple_times_keep_last = on;
Cause 2: The source table contains duplicate data because the time to live (TTL) expires.
A TTL is configured for the source table. Some data whose TTL expires in the source table is not deleted because the TTL is not exact. This may cause duplicate data in the source table. In this case, if data with duplicate primary key values is written into the destination table, one of the preceding error messages appears.
Solution:
In Hologres V1.3.23 and later, you can use the following statement to fix this issue. After the statement is executed, the system deletes the duplicate primary key values in the table. The default cleanup policy is Keep Last, which indicates that the last row with the duplicate primary key value is retained, and other rows are deleted.
NoteIn principle, a table does not contain duplicate primary key values. Therefore, this statement only clears duplicate primary key values that are caused by TTL expiration.
Hologres V1.3.23 and later support this statement. If the version of your Hologres instance is earlier than V1.3.23, upgrade your Hologres instance.
call public.hg_remove_duplicated_pk('<schema>.<table_name>');
Example: The
tbl_1
table is the destination table. Thetbl_2
table is the source table, and a TTL is configured for the source table. The TTL is set to300 seconds
. Insert an entire row of data in thetbl_2
table into thetbl_1
table. If the TTL expires and thetbl_2
table contains data with duplicate primary key values, data fails to be written and an error occurs.BEGIN; CREATE TABLE tbl_1 ( a int NOT NULL PRIMARY KEY, b int, c int ); CREATE TABLE tbl_2 ( d int NOT NULL PRIMARY KEY, e int, f int ); CALL set_table_property('tbl_2', 'time_to_live_in_seconds', '300'); COMMIT; INSERT INTO tbl_1 VALUES (1, 1, 1), (2, 3, 4); INSERT INTO tbl_2 VALUES (1, 5, 6); -- Insert data into the tbl_2 table after 300 seconds. INSERT INTO tbl_2 VALUES (1, 3, 6); -- Overwrite the row in the tbl_1 table with the row in the tbl_2 table. An error is reported because the tbl_2 table contains data with duplicate primary key values. INSERT INTO tbl_1 (a, b, c) SELECT d,e,f FROM tbl_2 ON CONFLICT (a) DO UPDATE SET (a,b,c) = ROW (excluded.*); -- The following error message is returned: ERROR: internal error: Duplicate keys detected when building hash table. -- Configure the Grand Unified Configuration (GUC) parameter to delete data with duplicate primary key values in the tbl_2 table. The Keep Last policy is used. call public.hg_remove_duplicated_pk('tbl_2'); -- Insert data into the tbl_1 table again. The operation is successful.
Cause 3: The instance has insufficient memory resources to run a job in which you want to write a large amount of data.
Solutions:
We recommend that you use the Serverless Computing feature provided by Hologres to run the job in which you want to write a large amount of data. Hologres V2.1.17 and later support the Serverless Computing feature. The Serverless Computing feature is suitable for scenarios in which you want to import a large amount of data offline, run large-scale extract, transform, and load (ETL) jobs, or query a large amount of data from foreign tables. You can use the Serverless Computing feature to perform the preceding operations based on additional serverless computing resources. This can eliminate the need to reserve additional computing resources for the instances. This improves instance stability and reduces the occurrences of out of memory (OOM) errors. You are charged only for the additional serverless computing resources used by tasks. For more information about the Serverless Computing feature, see Overview of Serverless Computing. For more information about how to use the Serverless Computing feature, see User guide on Serverless Computing.
You can use the solutions provided in FAQ about OOM.