All Products
Search
Document Center

Hologres:INSERT ON CONFLICT(UPSERT)

Last Updated:Aug 22, 2024

This topic describes how to use the INSERT ON CONFLICT statement in Hologres.

Scenarios

The INSERT ON CONFLICT statement is suitable for scenarios in which data is written by executing SQL statements.

If you use the Data Integration service of DataWorks or Realtime Compute for Apache Flink to insert data into a table, you can configure a write conflict processing parameter to determine whether to ignore the data that has the same primary key values as existing data or update the existing data.

  • Use the Data Integration service of DataWorks to write data to Hologres.

    The INSERT ON CONFLICT statement is integrated into Data Integration. For more information about how the statement works, see Hologres Writer. If you need to use Data Integration, you must configure one of the following settings:

    • If you use a batch synchronization node, set the conflictMode parameter to Ignore or Replace.

    • If you use a real-time synchronization node, set the conflictMode parameter to Ignore or Replace.

    Note

    A primary key must be specified for the destination Hologres table. This way, data in the table can be updated during data synchronization.

  • Use Realtime Compute for Apache Flink to write data to Hologres.

    If you use Realtime Compute for Apache Flink to write data to Hologres, the default write conflict policy is InsertOrIgnore. InsertOrIgnore indicates that data whose primary key values appear for the first time is inserted and data whose primary key values already exist is ignored. You must make sure that a primary key is specified when you create the destination Hologres table. If you use the CREATE TABLE AS SELECT (CTAS) syntax, the default write conflict policy is InsertOrUpdate. InsertOrUpdate indicates that existing data in specified fields is replaced by new data.

Syntax

You can use the INSERT ON CONFLICT statement to specify a policy to deal with data whose primary key values already exist when you insert data into a table. The execution result of the INSERT ON CONFLICT statement is similar to that of the UPSERT statement. The INSERT ON CONFLICT statement uses the following syntax:

INSERT INTO <table_name> [ AS <alias> ] [ ( <column_name> [, ...] ) ]
    {  VALUES ( { <expression> } [, ...] ) [, ...] | <query> }
    [ ON CONFLICT [ conflict_target ] conflict_action ]

where conflict_target is pk

    ON CONSTRAINT constraint_name

and conflict_action is one of:

    DO NOTHING
    DO UPDATE SET { <column_name> = { <expression> } |
                    ( <column_name> [, ...] ) = ( { <expression> } [, ...] ) |
                  } [, ...]
              [ WHERE condition ]

The following table describes the parameters in the preceding syntax.

Parameter

Description

table_name

The name of the table into which you want to insert data.

alias

The substitute name for the table into which you want to insert data.

column_name

The name of the column in the table into which you want to insert data.

DO NOTHING

The InsertOrIgnore mode is used. If you insert data whose primary key values already exist, the new data is ignored.

DO UPDATE

The InsertOrUpdate mode is used. If you insert data whose primary key values already exist, the existing data is updated

by using one of the following methods:

  • Data in all columns is updated.

  • Only data in specified columns is updated.

  • If you want to achieve the effect of the InsertOrReplace mode to insert the null value into the columns that are missing in the new data, manually insert the null value. For more information, see the "Examples" section in this topic.

  • Important
    • If a column contains a default value, the DO UPDATE parameter does not update the default value, and the performance is low.

    • If you want to achieve the effect of the InsertOrReplace mode by executing the INSERT ON CONFLICT statement, you need to specify the null value in VALUES in the statement. If you use Realtime Compute for Apache Flink or Data Integration to write data to Hologres and specify InsertOrReplace, the null value is automatically inserted.

expression

The expression that is executed on the specified column. You can refer to the PostgreSQL documentation to configure the expression.

You can use the common expression b=excluded.b or the simplified expression (a, b, c) = ROW (excluded.*) to insert or update data. The left side of the equation indicates the field to be updated. The right side of the equation indicates the inserted expression, which is the VALUES expression or the SELECT expression. excluded is the alias of the inserted expression instead of the alias of the source table. For example, in column_name = excluded.column_name, the column_name parameter specifies the name of the column into which data is inserted. If the column_name parameter specifies the Nth column in the destination table, excluded.column_name indicates the Nth column in the inserted expression. excluded.* specifies to insert data of all columns. You must make sure that the columns in the inserted expression are in the same order as the columns specified in the data definition language (DDL) statement of the destination table.

How it works

The implementation principle of the INSERT ON CONFLICT statement is the same as that of the UPDATE statement. For more information about the UPDATE statement, see UPDATE. Storage modes are classified into row-oriented storage, column-oriented storage, and row-column hybrid storage. The processing method varies based on the storage mode that a table uses during updates. This leads to different update performance for tables that use different storage modes. INSERT ON CONFLICT supports the update modes InsertOrIgnore, InsertOrReplace, and InsertOrUpdate to meet business requirements. The following table describes the differences among the update modes.

Update mode

Description

InsertOrIgnore

Discards the data whose primary key values already exist. This mode is implemented by using the insert on conflict do nothing statement.

InsertOrUpdate

Updates existing data if you insert data whose primary key values already exist. This update mode involves updates of an entire row and updates of specific columns of a row. If the written data does not cover all columns, the columns that are missing in the new data are not updated. This mode is implemented by using the insert on conflict do update statement.

InsertOrReplace

Replaces existing data if you insert data whose primary key values already exist. If the written data does not cover all columns, the null value is inserted into the columns that are missing in the new data. This mode is implemented by using the insert on conflict do update statement and manual configuration of the null value.

The update performance varies based on the update mode and storage mode of tables.

  • Update performance of column-oriented tables varies based on the update mode:

    • The destination table without a primary key has the highest performance.

    • If the destination table has a primary key, the update modes are listed in descending order by update performance as InsertOrIgnore > InsertOrReplace ≥ InsertOrUpdate (entire row) > InsertOrUpdate (some columns).

  • Update modes of row-oriented tables are listed in the following descending order by update performance:

    InsertOrReplace = InsertOrUpdate (entire row) ≥ InsertOrUpdate (specific columns) ≥ InsertOrIgnore.

Limits

  • In the INSERT ON CONFLICT statement, you must specify all primary key columns.

  • When you execute the INSERT ON CONFLICT statement by using the Hologres Query Engine (HQE), data is not written in order. Therefore, you cannot implement the Keep First or Keep Last policy. However, we recommend that you use the Keep Last policy if you want to deduplicate primary key values in the data source. Sample statement:

    -- Retain the last row in duplicate rows.
    set hg_experimental_affect_row_multiple_times_keep_last = on;

Examples

  • This section describes how to use the INSERT ON CONFLICT statement.

    Note

    Hologres V2.1.17 and later support the Serverless Computing feature. The Serverless Computing feature is suitable for scenarios in which you want to import a large amount of data offline, run large-scale extract, transform, and load (ETL) jobs, or query a large amount of data from foreign tables. You can use the Serverless Computing feature to perform the preceding operations based on additional serverless computing resources. This can eliminate the need to reserve additional computing resources for the instances. This improves instance stability and reduces the occurrences of out of memory (OOM) errors. You are charged only for the additional serverless computing resources used by tasks. For more information about the Serverless Computing feature, see Overview of Serverless Computing. For more information about how to use the Serverless Computing feature, see User guide on Serverless Computing.

    1. Create a table and insert data into the table.

      begin ;
      create table test1 (
          a int NOT NULL PRIMARY KEY,
          b int,
          c int
      );
      commit ;
      
      insert into test1 values (1,2,3);
      
    2. Examples in different scenarios:

      Note

      The results in the following scenarios are independent of each other and have no sequential relationship. The table that is created in the preceding step is used in the following scenarios.

      • Scenario 1: Implement the InsertOrIgnore mode. Data is not updated if the inserted primary key value already exists.

        INSERT INTO test1 (a, b, c) VALUES (1, 1, 1)
        ON CONFLICT (a)
        DO NOTHING;
        
        -- After the update, the test1 table contains the following data:
        a	b	c
        1	2	3
      • Scenario 2: Implement the InsertOrUpdate mode to update an entire row of data by using one of the following methods:

        • Method 1: List all columns in the SET..EXCLUDED clause.

          INSERT INTO test1 (a, b, c) VALUES (1, 1, 1)
          ON CONFLICT (a)
          DO UPDATE SET b = EXCLUDED.b, c = EXCLUDED.c;
          
          -- After the update, the test1 table contains the following data:
          a	b	c
          1	1	1
        • Method 2: Use ROW(EXCLUDED.*) to update all columns in a row.

          INSERT INTO test1 (a, b, c)VALUES (1, 1, 1)
          ON CONFLICT (a)
          DO UPDATE SET (a,b,c) = ROW(EXCLUDED.*);
          
          -- After the update, the test1 table contains the following data:
          a	b	c
          1	1	1
      • Scenario 3: Implement the InsertOrUpdate mode to update specified columns. Only data in specified columns is updated.

        -- To update specific columns, specify the columns in the DO UPDATE SET clause.
        INSERT INTO test1 (a, b, c) VALUES (1, 1, 1)
        ON CONFLICT (a)
        DO UPDATE SET b = EXCLUDED.b;
        
        -- Column c in the table is not updated. After the update, the test1 table contains the following data:
        a	b	c
        1	1	3
      • Scenario 4: Implement the InsertOrReplace mode. Insert the null value into columns that are missing in the new data.

        -- To implement the InsertOrReplace mode to insert the null value into the columns that are missing in the new data, manually insert the null value. 
        INSERT INTO test1 (a, b,c) VALUES (1, 1,null)
        ON CONFLICT (a)
        DO UPDATE SET b = EXCLUDED.b,c = EXCLUDED.c;
        
        -- After the update, the test1 table contains the following data:
        a	b	c
        1	1	\N
      • Scenario 5: Insert data from the test2 table into the test1 table.

        -- Create a table named test2 and insert data into the test2 table.
        CREATE TABLE test2 (
            d int NOT NULL PRIMARY KEY,
            e int,
            f int
        );
        INSERT INTO test2 VALUES (1, 5, 6);
        
        
        -- Use the test2 table to overwrite the row in the test1 table that shares the same primary key value with the test2 table.
        INSERT INTO test1 (a, b, c) SELECT d,e,f FROM test2 ON CONFLICT (a) DO UPDATE SET
        (a,b,c) = ROW (excluded.*);
        
        -- After the update, the test1 table contains the following data:
        a	b	c
        1	5	6
        
        -- Use the test2 table to overwrite the row in the test1 table that shares the same primary key value with the test2 table. The column mapping is changed. The data in Column e of the test2 table is updated to the column c of the test1 table, and data in Column f of the test2 table is updated to the column b of the test1 table.
        INSERT INTO test1 (a, b, c) SELECT d,e,f FROM test2 ON CONFLICT (a) DO UPDATE SET
        (a,c,b) = ROW (excluded.*);
        -- After the update, the test1 table contains the following data:       
        a	b	c
        1	6	5
  • This section describes how to optimize the INSERT ON CONFLICT statement for row-oriented tables.

    Hologres optimizes the INSERT ON CONFLICT statement for row-oriented tables. We recommend that you specify columns in the DO UPDATE clause in the same order as those in the INSERT clause and update all columns.

    INSERT INTO test1 (a, b, c) 
    SELECT
        d,e,f
    FROM
        test2
    ON CONFLICT (a)
        DO UPDATE SET
            a = excluded.a,
            b = excluded.b,
            c = excluded.c;
            
    INSERT INTO test1 (a, b, c)
    SELECT d,e,f FROM test2
    ON CONFLICT (a)
    DO UPDATE SET(a,b,c) = ROW (excluded.*)

Errors and troubleshooting

  • Problem description

    One of the following errors occurs when you execute the INSERT ON CONFLICT statement to insert data into a table.

    • Error Message 1: duplicate key value violates unique constraint.

    • Error Message 2: Update row with Key (xxx)=(yyy) multiple times.

    • Error Message 3 (OOM errors): Total memory used by all existing queries exceeded memory limitation.

  • Cause 1: The source data contains duplicate rows.

    Hologres is compatible with PostgreSQL and allows you to use the standard PostgreSQL syntax. If you use the standard PostgreSQL syntax to execute the INSERT ON CONFLICT statement to insert data into a table, you must make sure that the source data does not contain duplicate rows. If the source data contains duplicate rows, one of the preceding error messages appears.

    Note

    Duplication in source data indicates that the source data contains duplicate rows. It does not indicate that the destination table contains the same row as the row to be inserted.

    The source data that is specified in the INSERT ON CONFLICT statement contains duplicate rows. Sample statement:

    INSERT INTO test1 VALUES (1, 2, 3), (1, 2, 3)
    ON CONFLICT (a) 
    DO UPDATE SET (a, b, c) = ROW (excluded.*);

    Solution:

    If the source data contains duplicate rows, run the following command to retain the last duplicate row.

    set hg_experimental_affect_row_multiple_times_keep_last = on;
  • Cause 2: The source table contains duplicate data because the time to live (TTL) expires.

    A TTL is configured for the source table. Some data whose TTL expires in the source table is not deleted because the TTL is not exact. This may cause duplicate data in the source table. In this case, if data with duplicate primary key values is written into the destination table, one of the preceding error messages appears.

    Solution:

    In Hologres V1.3.23 and later, you can use the following statement to fix this issue. After the statement is executed, the system deletes the duplicate primary key values in the table. The default cleanup policy is Keep Last, which indicates that the last row with the duplicate primary key value is retained, and other rows are deleted.

    Note
    • In principle, a table does not contain duplicate primary key values. Therefore, this statement only clears duplicate primary key values that are caused by TTL expiration.

    • Hologres V1.3.23 and later support this statement. If the version of your Hologres instance is earlier than V1.3.23, upgrade your Hologres instance.

    call public.hg_remove_duplicated_pk('<schema>.<table_name>');

    Example: The tbl_1 table is the destination table. The tbl_2 table is the source table, and a TTL is configured for the source table. The TTL is set to 300 seconds. Insert an entire row of data in the tbl_2 table into the tbl_1 table. If the TTL expires and the tbl_2 table contains data with duplicate primary key values, data fails to be written and an error occurs.

    
    BEGIN;
    CREATE TABLE tbl_1 (
      a int NOT NULL PRIMARY KEY,
      b int,
      c int
    );
    CREATE TABLE tbl_2 (
      d int NOT NULL PRIMARY KEY,
      e int,
      f int
    );
    CALL set_table_property('tbl_2', 'time_to_live_in_seconds', '300'); 
    COMMIT;
    
    INSERT INTO tbl_1 VALUES (1, 1, 1), (2, 3, 4);
    INSERT INTO tbl_2 VALUES (1, 5, 6);
    -- Insert data into the tbl_2 table after 300 seconds.
    INSERT INTO tbl_2 VALUES (1, 3, 6);
    
    -- Overwrite the row in the tbl_1 table with the row in the tbl_2 table. An error is reported because the tbl_2 table contains data with duplicate primary key values.
    INSERT INTO tbl_1 (a, b, c)
    SELECT
        d,e,f
    FROM
        tbl_2
    ON CONFLICT (a)
        DO UPDATE SET
            (a,b,c) = ROW (excluded.*);
    -- The following error message is returned: ERROR: internal error: Duplicate keys detected when building hash table.
    
    -- Configure the Grand Unified Configuration (GUC) parameter to delete data with duplicate primary key values in the tbl_2 table. The Keep Last policy is used. 
    call public.hg_remove_duplicated_pk('tbl_2');
    
    -- Insert data into the tbl_1 table again. The operation is successful.
  • Cause 3: The instance has insufficient memory resources to run a job in which you want to write a large amount of data.

    Solutions:

    • We recommend that you use the Serverless Computing feature provided by Hologres to run the job in which you want to write a large amount of data. Hologres V2.1.17 and later support the Serverless Computing feature. The Serverless Computing feature is suitable for scenarios in which you want to import a large amount of data offline, run large-scale extract, transform, and load (ETL) jobs, or query a large amount of data from foreign tables. You can use the Serverless Computing feature to perform the preceding operations based on additional serverless computing resources. This can eliminate the need to reserve additional computing resources for the instances. This improves instance stability and reduces the occurrences of out of memory (OOM) errors. You are charged only for the additional serverless computing resources used by tasks. For more information about the Serverless Computing feature, see Overview of Serverless Computing. For more information about how to use the Serverless Computing feature, see User guide on Serverless Computing.

    • You can use the solutions provided in FAQ about OOM.