All Products
Search
Document Center

MaxCompute:UPDATE and DELETE

Last Updated:Dec 04, 2024

MaxCompute allows you to execute the DELETE or UPDATE statement to delete or update data of specific rows in transactional tables.

Prerequisites

You are granted the Select and Update permissions on the transactional table on which you want to execute the DELETE or UPDATE statement. For more information, see MaxCompute permissions.

Description

Similar to traditional SQL statements, the DELETE and UPDATE statements in MaxCompute can be used to delete or update the data of specific rows in tables.

Each time you execute the DELETE or UPDATE statement, a delta file is automatically generated to store information about the delete or update operation. This file is invisible to users. The following items describe how the delta files are generated.

  • DELETE: A delta file contains the txnid and rowid fields, both of which have a value of the BIGINT type. The rowid field indicates the deleted row in the base files of the transactional table. The txnid field indicates the delete operation that is performed on the row.

    For example, a base file of the t1 table is f1 and the content of the base file is a, b, c, a, b. When you execute the delete from t1 where c1='a'; statement, a delta file named f1.delta is generated. If the value of the txnid field is t0, the content of the f1.delta file is ((0, t0), (3, t0)). This indicates that the rows whose IDs are 0 and 3 are deleted from the t0 transaction. If you execute another DELETE statement on the t1 table again, another delta file named f2.delta is generated. The file name is generated based on the f1 base file. When you query the data in the t1 table, the system filters the deleted data based on the f1, f1.delta, and f2.delta files and returns the data that is not deleted.

  • UPDATE: The logic of an UPDATE statement is converted into the logic of executing the DELETE and INSERT INTO statements.

The DELETE and UPDATE statements have the following benefits:

  • Reduce the amount of data to be written

    Before the DELETE and UPDATE statements are provided, MaxCompute allows you to execute only the INSERT INTO or INSERT OVERWRITE statement to update or delete data in tables. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE). If you want to execute an INSERT statement when you update a small amount of data in a table or a partition of the table, you must first execute a SELECT statement to read all data from the table and update the data. Then, you can execute the INSERT statement to insert all data into the table. This method is inefficient. However, if you use the DELETE or UPDATE statement in the preceding scenario, the system does not need to write all data in the table. This reduces the amount of data to be written.

    Note
    • If you use the pay-as-you-go billing method, you are not charged for the write operations that are performed by executing the DELETE, UPDATE, or INSERT OVERWRITE statement. However, when you execute the DELETE or UPDATE statement, MaxCompute must filter data by partition and read the data that you want to delete or update. You are charged for the read operations based on the pay-as-you-go billing method of SQL jobs. Therefore, compared with the INSERT OVERWRITE statement, the DELETE or UPDATE statement does not help you reduce costs.

    • If you use the subscription billing method, fewer resources are consumed to write data when you execute the DELETE or UPDATE statement. Compared with the INSERT OVERWRITE statement, the DELETE or UPDATE statement allows you to run more jobs when the same amount of resources is used.

  • Read the table with the latest data

    Before the DELETE and UPDATE statements are provided, MaxCompute allows you to use history tables to update multiple data entries in a table. If you use a history table, you must add auxiliary columns such as start_date and end_date in the table. These columns indicate the lifecycle of a data entry. To query the latest data of a table, the system must identify the latest data from large amounts of data based on the timestamps. This process is time-consuming. However, you can execute the DELETE or UPDATE statement to delete or update data. When you query the data in a table, the system reads the latest data of the table based on the base files and all delta files.

Important

After you execute the DELETE and UPDATE statements multiple times on a transactional table, the transactional table occupies a larger storage space. In this case, the costs of the storage and subsequent queries on the table increase. In addition, the efficiency of subsequent queries is reduced. To resolve these issues, we recommend that you execute the ALTER TABLE COMPACT statement to merge the base files with all delta files on a regular basis. For more information, see ALTER TABLE COMPACT.

If multiple jobs are run in parallel on a table, conflicts may occur. For more information, see ACID semantics.

Scenarios

You can execute the DELETE or UPDATE statement to delete or update a small amount of data in tables or partitions of the tables at a low frequency. For example, you can execute the statement to delete or update less than 5% of the data in a table or a partition of the table on the next day after the data is generated.

The DELETE or UPDATE statement is not applicable if you want to delete or update data at a high frequency or if you want to write data to tables in real time.

Limits

  • The DELETE statement, UPDATE statement, and a transactional table or Delta table on which the DELETE or UPDATE statement is executed have the following limits:

    • Before you execute the UPDATE, DELETE, or INSERT OVERWRITE statement on important data in transactional tables, you must execute the SELECT and INSERT statements to back up the data to other tables.

    • You cannot execute the UPDATE statement to change values in the primary key column of a Delta table.

Precautions

When you execute the DELETE or UPDATE statement to delete or update data in tables or partitions of the tables, take note of the following items:

  • In specific scenarios, you may want to execute the DELETE or UPDATE statement for a small amount of data in a table and infrequently perform read and other operations in subsequent procedures. To reduce the storage space that is occupied by the table, we recommend that you merge the base files with all delta files after you execute the DELETE or UPDATE statement for the table several times. For more information, see ALTER TABLE COMPACT.

  • In specific scenarios, you may want to delete or update more than 5% of the data in a table or a partition of the table at a low frequency and perform frequent read operations in subsequent procedures. We recommend that you execute the INSERT OVERWRITE or INSERT INTO statement in such scenarios. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE).

    For example, you want to perform delete or update operations for 10% of data 10 times each day. In this case, we recommend that you estimate the total cost and the consumption of the subsequent read performance if you execute the DELETE or UPDATE statement on the table. Then, compare the estimated result with that of executing the INSERT OVERWRITE or INSERT INTO statement. This helps you choose an efficient method.

  • Each time you execute the DELETE statement on a table, a delta file is automatically generated. As a result, the occupied storage space may not be reduced. If you want to execute the DELETE statement to delete data to reduce storage usage, you can merge the base files with all delta files. For more information, see ALTER TABLE COMPACT.

  • MaxCompute executes multiple DELETE and UPDATE statements in jobs at a time. Each statement consumes resources and incurs fees. We recommend that you delete or update a batch of data at a time. For example, if you run a Python script to generate and submit a large number of row-level update jobs, and each statement is executed for only one row or a small number of rows of data, each statement incurs fees that correspond to the amount of input data scanned by the SQL statement and consumes the related computing resources. When multiple statements are accumulated, the costs are significantly increased and the system efficiency is reduced. Sample statements:

    • We recommend that you execute the following statement:

      UPDATE table1 SET col1= (SELECT value1 FROM table2 WHERE table1.id = table2.id AND table1.region = table2.region);
    • We recommend that you do not execute the following statements:

      UPDATE table1 SET col1=1 WHERE id='2021063001'AND region='beijing';                  
      UPDATE table1 SET col1=2 WHERE id='2021063002'AND region='beijing';

DELETE

You can execute the DELETE statement to delete one or more rows that meet the specified conditions from a transactional table or Delta table.

  • Syntax

    delete from <table_name> [where <where_condition>];
  • Parameters

    • table_name: required. The name of the transactional table or Delta table on which you want to execute the DELETE statement.

    • where_condition: optional. A WHERE clause that is used to filter data based on conditions. For more information. see WHERE clause (where_condition). If you execute the DELETE statement on a table without a WHERE clause, all data in the table is deleted.

  • Examples

    • Example 1: Create a non-partitioned transactional table named acid_delete and insert data into the table. Then, execute the DELETE statement to delete the rows that meet the specified conditions from the table. Sample statements:

      -- Create a non-partitioned transactional table named acid_delete. 
      create table if not exists acid_delete(id bigint) tblproperties ("transactional"="true"); 
      
      -- Insert data into the table. 
      insert overwrite table acid_delete values(1),(2),(3),(2); 
      
      -- Query the table to check whether data is inserted. 
      select * from acid_delete; 
      -- The following result is returned:
      +------------+
      | id         |
      +------------+
      | 1          |
      | 2          |
      | 3          |
      | 2          |
      +------------+
      
      -- Delete the rows whose value of the id column is 2. If you execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. 
      delete from acid_delete where id = 2; 
      
      -- Query the table to check whether the table contains only the rows whose values of the id column are 1 and 3. 
      select * from acid_delete; 
      -- The following result is returned:
      +------------+
      | id         |
      +------------+
      | 1          |
      | 3          |
      +------------+
    • Example 2: Create a partitioned transactional table named acid_delete_pt and insert data into the table. Then, execute the DELETE statement to delete the rows that meet the specified conditions from the table. Sample statements:

      -- Create a partitioned transactional table named acid_delete_pt.  
      create table if not exists acid_delete_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");
      
      -- Add partitions to the table. 
      alter table acid_delete_pt add if not exists partition (ds= '2019');
      alter table acid_delete_pt add if not exists partition (ds= '2018');
      
      -- Insert data into the partitions. 
      insert overwrite table acid_delete_pt partition (ds='2019') values(1),(2),(3);
      insert overwrite table acid_delete_pt partition (ds='2018') values(1),(2),(3);
      
      -- Query the table to check whether data is inserted. 
      select * from acid_delete_pt;
      -- The following result is returned:
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 1          | 2018       |
      | 2          | 2018       |
      | 3          | 2018       |
      | 1          | 2019       |
      | 2          | 2019       |
      | 3          | 2019       |
      +------------+------------+
      
      -- Delete the rows whose values of the id and ds columns are 2 and 2019. If you execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. 
      delete from acid_delete_pt where ds='2019' and id = 2;
      
      -- Query the table to check whether the rows whose values of the id and ds columns are 2 and 2019 are deleted. 
      select * from acid_delete_pt;
      -- The following result is returned:
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 1          | 2018       |
      | 2          | 2018       |
      | 3          | 2018       |
      | 1          | 2019       |
      | 3          | 2019       |
      +------------+------------+
    • Example 3: Create a destination table named acid_delete_t and an associated table named acid_delete_s. Then, delete the rows that meet the specified conditions from the destination table based on the associated table. Sample statements:

      -- Create a destination table named acid_delete_t and an associated table named acid_delete_s. 
      create table if not exists acid_delete_t(id int,value1 int,value2 int) tblproperties ("transactional"="true");
      create table if not exists acid_delete_s(id int,value1 int,value2 int);
      
      -- Insert data into the tables. 
      insert overwrite table acid_delete_t values(2,20,21),(3,30,31),(4,40,41);
      insert overwrite table acid_delete_s values(1,100,101),(2,200,201),(3,300,301);
      
      -- Delete the rows in the acid_delete_t table whose value of the id column does not match that of the rows in the acid_delete_s table. If you want to execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. 
      delete from acid_delete_t where not exists (select * from acid_delete_s where acid_delete_t.id=acid_delete_s.id);
      
      -- Query the acid_delete_t table to check whether the table contains only the rows whose values of the id column are 2 and 3. 
      select * from acid_delete_t;
      
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      +------------+------------+------------+
    • Example 4: Create a Delta table named mf_dt, import data to the table, and then execute the DELETE statement to delete rows that meet the specified conditions from the table. Sample statements:

      -- Create a Delta table named mf_dt. 
      create table if not exists mf_dt (pk bigint not null primary key, 
                        val bigint not null) 
                        partitioned by (dd string, hh string) 
                        tblproperties ("transactional"="true");
      
      -- Insert data into the table.
      insert overwrite table mf_dt partition (dd='01', hh='02') values (1, 1), (2, 2), (3, 3);
      
      -- Query the table to check whether data is inserted.                 
      select * from mf_dt where dd='01' and hh='02';
      
      -- The following result is returned:
      +------------+------------+----+----+
      | pk         | val        | dd | hh |
      +------------+------------+----+----+
      | 1          | 1          | 01 | 02 |
      | 3          | 3          | 01 | 02 |
      | 2          | 2          | 01 | 02 |
      +------------+------------+----+----+
      
      -- Delete the row whose value of the val column is 2 from the dd='01' and hh='02' partition. 
      delete from mf_dt where val = 2  and dd='01' and hh='02';
      
      -- Query the result table. Only the rows whose values of the val column are 1 and 3 exist in the specified partition.
      select * from mf_dt where dd='01' and hh='02';
      
      -- The following result is returned:
      +------------+------------+----+----+
      | pk         | val        | dd | hh |
      +------------+------------+----+----+
      | 1          | 1          | 01 | 02 |
      | 3          | 3          | 01 | 02 |
      +------------+------------+----+----+

ALTER TABLE CLEAR COLUMN

You can execute the ALTER TABLE CLEAR COLUMN statement to clear data in columns that are no longer required in a common table from a disk and set the values in the columns to null. This can reduce storage costs.

  • Syntax

    ALTER TABLE <table_name> 
               [partition ( <pt_spec>[, <pt_spec>....] )] 
    					 CLEAR COLUMN column1[, column2, column3, ...]
    			                         [without touch];
  • Parameters

    • table_name: the name of the table on which you want to execute the ALTER TABLE CLEAR COLUMN statement.

    • column1 , column2 ... : the names of columns whose data needs to be cleared.

    • partition: the partition on which you want to execute the ALTER TABLE CLEAR COLUMN statement. If no partition is specified, the statement is executed on all partitions in the table.

    • pt_spec: the description of the partition. Configure this parameter in the (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...) format.

    • without touch: If you configure this parameter, the time when the data was last modified is not updated. Otherwise, the time is updated.

    Note

    By default, the without touch parameter is configured. You can also delete data from columns in the future even if the without touch parameter is not configured. If you do not configure the without touch parameter, the time when the data was last modified is updated.

  • Limits

    • The ALTER TABLE CLEAR COLUMN statement cannot be executed for columns that have the non-nullable property. You can remove the non-nullable property for columns.

      alter table <table_name> change column <old_col_name> null;
    • The ALTER TABLE CLEAR COLUMN statement cannot be executed on atomicity, consistency, isolation, durability (ACID) tables.

    • The ALTER TABLE CLEAR COLUMN statement cannot be executed on clustered tables.

    • The ALTER TABLE CLEAR COLUMN statement cannot be executed to clear data in the subcolumns within a column of the NESTED type.

    • The ALTER TABLE CLEAR COLUMN statement cannot be executed to clear data in all columns of a table. If you want to clear data in all columns of a table, you can execute the DROP TABLE statement, which provides better performance.

  • Precautions

    • The ALTER TABLE CLEAR COLUMN statement does not change the Archive property of a table.

    • You may fail to execute the ALTER TABLE CLEAR COLUMN statement to clear data in a column of the NESTED type.

      The failure occurs if you clear nested data that is in column-oriented storage mode but the mode is disabled for the data.

    • The execution of the ALTER TABLE CLEAR COLUMN statement depends on online storage services. If you have a large number of jobs to run, the statement may need to queue and require a long period of time to complete.

    • When you execute the ALTER TABLE CLEAR COLUMN statement, computing resources are consumed to perform data read and write operations. If you use the subscription billing method, your computing resources are occupied. If you use the pay-as-you-go billing method, you are charged for executing the ALTER TABLE CLEAR COLUMN statement based on the same billing rule as an SQL job. The ALTER TABLE CLEAR COLUMN statement for the pay-as-you-go billing method is in invitational preview. Users who use this billing method can execute the statement free of charge.

  • Examples

    • -- Create a table.
      create table if not exists mf_cc(key string, 
           value string, 
           a1 BIGINT , 
           a2 BIGINT , 
           a3 BIGINT , 
           a4 BIGINT)  
           partitioned by(ds string, hr string);
                       
      -- Add a partition to the table.
      alter table mf_cc add if not exists partition (ds='20230509', hr='1641');
      
      -- Insert data into the partition.
      insert into mf_cc partition (ds='20230509', hr='1641') 
           values("key","value",1,22,3,4);
           
      -- Query the partition to check whether data is inserted.     
      select * from mf_cc where ds='20230509' and  hr='1641';
      -- The following result is returned:
      +-----+-------+------------+------------+--------+------+---------+-----+
      | key | value | a1         | a2         | a3     | a4   | ds      | hr  |
      +-----+-------+------------+------------+--------+------+---------+-----+
      | key | value | 1          | 22         | 3      | 4    | 20230509| 1641|
      +-----+-------+------------+------------+--------+------+---------+-----+
      -- Clear data from columns.
      alter table mf_cc partition(ds='20230509', hr='1641') 
                        clear column key,a1 
          without touch;
      -- Query data from the columns.     
      select * from mf_cc where ds='20230509' and  hr='1641';
      -- The following result is returned. The values of the key and a1 columns become null.
      +-----+-------+------------+------------+--------+------+---------+-----+
      | key | value | a1         | a2         | a3     | a4   | ds      | hr  |
      +-----+-------+------------+------------+--------+------+---------+-----+
      | null| value | null       | 22         | 3      | 4    | 20230509| 1641|
      +-----+-------+------------+------------+--------+------+---------+-----+
    • The following figure shows the change of the total storage space of the lineitem table whose all columns are cleared one by one by executing the ALTER TABLE CLEAR COLUMN statement multiple times. The data storage format of the lineitem table is AliORC. The lineitem table contains 16 columns, which are of the BIGINT, DECIMAL, CHAR, DATE, and VARCHAR data types.image.png

      After the ALTER TABLE CLEAR COLUMN statement is repeatedly executed to change the values of all columns in the table to null in sequence, the storage space that is occupied by the table changes from 186,783,526 bytes to 236,715 bytes, reduced by 99.97%.

      Note
      • The size of the storage space that is released by executing the ALTER TABLE CLEAR COLUMN statement is related to the data types of columns and the values in the columns. In this example, the storage space of the DECIMAL-type l_extendedprice column is changed from 146,538,799 bytes to 111,138,117 bytes, reduced by 24.2%. This is significantly greater than the average percentage of the storage space that is released.

      • After the values of all columns are set to null, the storage space that is occupied by the table becomes 236,715 bytes, instead of 0. This is because the structures of files generated for the table still exist. Null values also occupy a small amount of storage space, and the system retains the footer information of the files.

UPDATE

You can execute the UPDATE statement to update the values of one or more columns of the rows that meet the specified conditions in transactional tables or Delta tables.

  • Syntax

    -- Method 1
    update <table_name> set <col1_name> = <value1> [, <col2_name> = <value2> ...] [WHERE <where_condition>];
    
    -- Method 2
    update <table_name> set (<col1_name> [, <col2_name> ...]) = (<value1> [, <value2> ...])[WHERE <where_condition>];
    
    -- Method 3
    UPDATE <table_name>
           SET <col1_name> = <value1> [ , <col2_name> = <value2> , ... ]
            [ FROM <additional_tables> ]
            [ WHERE <where_condition> ]
  • Parameters

    • table_name: required. The name of the transactional table on which you want to execute the UPDATE statement.

    • col1_name and col2_name: the columns that you want to update. You must specify at least one column. This parameter specifies the names of the columns that contain the row that meets specified filter conditions.

    • value1 and value2: the new values that you want to assign to the specified columns. You must update the value of at least one column. This parameter specifies the new values of the columns corresponding to the row that meets specified filter conditions.

    • where_condition: optional. A WHERE clause that is used to filter data based on conditions. For more information, see WHERE clause (where_condition). If you execute the UPDATE statement on a table without a WHERE clause, all data in the table is updated.

    • additional_tables: optional. A from clause.

      The UPDATE statement can be used with the from clause. The from clause makes the use of the UPDATE statement more convenient. The following table describes the UPDATE statements that are used with and without the from clause.

      Scenario

      Sample code

      Without the from clause

      update target set v =
        (select min(v) from src group by k where target.k = src.key)
        where target.k in (select k from src);                                 

      With the from clause

      update target set v = b.v
        from (select k, min(v) v from src group by k) b
        where target.k = b.k;

      The following conclusions are obtained based on the preceding sample code:

      • If you use multiple rows of data in the source table to update one row of data in the destination table, you must write aggregate operations to ensure the uniqueness of the data source. The code of aggregate operations is simpler and easier to understand when you use the from clause than that when you do not use the from clause.

      • If you need to only update the intersection data during join operations, you must use a from clause or a where clause. The where clause is more complex than the from clause.

  • Examples

    • Example 1: Create a non-partitioned table named acid_update and insert data into the table. Then, execute the UPDATE statement to update the columns of the rows that meet the specified conditions in the table. Sample statements:

      -- Create a non-partitioned table named acid_update.   
      create table if not exists acid_update(id bigint) tblproperties ("transactional"="true");
      
      -- Insert data into the table. 
      insert overwrite table acid_update values(1),(2),(3),(2);
      
      -- Query the table to check whether data is inserted. 
      select * from acid_update; 
      
      -- The following result is returned:
      +------------+
      | id         |
      +------------+
      | 1          |
      | 2          |
      | 3          |
      | 2          |
      +------------+
      
      -- Update the value 2 in the id column to 4. 
      update acid_update set id = 4 where id = 2; 
      
      -- Query the partition to check whether the value 2 is updated to 4 in the id column. 
      select * from acid_update; 
      
      -- The following result is returned:
      +------------+
      | id         |
      +------------+
      | 1          |
      | 3          |
      | 4          |
      | 4          |
      +------------+
    • Example 2: Create a partitioned table named acid_update and insert data into the table. Then, execute the UPDATE statement to update the columns of the rows that meet the specified conditions in the table. Sample statements:

      -- Create a partitioned table named acid_update_pt.  
      create table if not exists acid_update_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");
      
      -- Add a partition to the table. 
      alter table acid_update_pt add if not exists partition (ds= '2019');
      
      -- Insert data into the partition. 
      insert overwrite table acid_update_pt partition (ds='2019') values(1),(2),(3);
      
      -- Query the partition to check whether data is inserted.
      select * from acid_update_pt where ds = '2019';
      
      -- The following result is returned:
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 1          | 2019       |
      | 2          | 2019       |
      | 3          | 2019       |
      +------------+------------+
      
      -- Update the value 2 of the id column in the 2019 partition to 4. 
      update acid_update_pt set id = 4 where ds = '2019' and id = 2; 
      
      -- Query the partition to check whether the value 2 is updated to 4 in the id column. 
      select * from acid_update_pt where ds = '2019';
      
      -- The following result is returned:
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 4          | 2019       |
      | 1          | 2019       |
      | 3          | 2019       |
      +------------+------------+
    • Example 3: Create a transactional table named acid_update_t that you want to update and an associated table named acid_update_s. Then, update the values of multiple columns at a time in the acid_update_t table. Sample statements:

      -- Create a transactional table named acid_update_t that you want to update and an associated table named acid_update_s. 
      create table if not exists acid_update_t(id int,value1 int,value2 int) tblproperties ("transactional"="true");
      create table if not exists acid_update_s(id int,value1 int,value2 int);
      
      -- Insert data into the tables. 
      insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41);
      insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301);
      
      -- Method 1: Update the values of specific columns with constants. 
      update acid_update_t set (value1, value2) = (60,61);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 60         | 61         |
      | 3          | 60         | 61         |
      | 4          | 60         | 61         |
      +------------+------------+------------+
      
      -- Method 2: Use the data in the acid_update_s table to update all rows in the acid_update_t table. If specific rows in the acid_update_t table cannot be matched, null values are assigned to the rows. 
      update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      | 4          | NULL       | NULL       |
      +------------+------------+------------+
      
      -- Method 3: Use the data in the acid_update_s table to update the acid_update_t table. Add filter conditions for the acid_update_t table to update only the rows that intersect with those in the acid_update_s table. 
      update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id) where acid_update_t.id in (select id from acid_update_s);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      | 4          | NULL       | NULL       |
      +------------+------------+------------+
      
      -- Method 4: Use the aggregate results of the acid_update_t and acid_update_s tables to update the acid_update_t table. 
      update acid_update_t set (id, value1, value2) = (select id, max(value1),max(value2) from acid_update_s where acid_update_t.id = acid_update_s.id group by acid_update_s.id) where acid_update_t.id in (select id from acid_update_s);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      | 4          | NULL       | NULL       |
      +------------+------------+------------+
    • Example 4: Perform join operations on two tables. Sample statements:

      -- Create a destination table named acid_update_t and create a table named acid_update_s for joins.
      create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true");
      create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint);
      
      -- Insert data into these tables.
      insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41);
      insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301);
      
      -- Query data from the acid_update_t table.
      select * from acid_update_t;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      | 4          | 40         | 41         |
      +------------+------------+------------+
      
      -- Query data from the acid_update_s table.
      select * from acid_update_s;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 1          | 100        | 101        |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      +------------+------------+------------+
      
      -- Use the data in the acid_update_s table to update the acid_update_t table. Add filter conditions for the acid_update_t table to update only the intersection of the two tables.
      update acid_update_t set value1 = b.value1, value2 = b.value2
      from acid_update_s b where acid_update_t.id = b.id;
      
      -- Value 20 is updated to 200, value 21 is updated to 201, value 30 is updated to 300, and value 31 is updated to 301.
      select * from acid_update_t;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 4          | 40         | 41         |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      +------------+------------+------------+
                                  
    • Example 5: Perform complex join operations on more than two tables. Sample statements:

      -- Create a destination table named acid_update_t and create a table named acid_update_s for joins.
      create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true");
      create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint);
      create table if not exists acid_update_m(id bigint,value1 bigint,value2 bigint);
      
      -- Insert data into these tables.
      insert overwrite table acid_update_t
      values(2,20,21),(3,30,31),(4,40,41),(5,50,51);
      insert overwrite table acid_update_s
      values (1,100,101),(2,200,201),(3,300,301),(4,400,401),(5,500,501);
      insert overwrite table acid_update_m
      values(3,30,101),(4,400,201),(5,300,301);
      
      -- Query data from the acid_update_t table.
      select * from acid_update_t;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      | 4          | 40         | 41         |
      | 5          | 50         | 51         |
      +------------+------------+------------+
      
      -- Query data from the acid_update_s table.
      select * from acid_update_s;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 1          | 100        | 101        |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      | 4          | 400        | 401        |
      | 5          | 500        | 501        |
      +------------+------------+------------+
      
      -- Query data from the acid_update_m table.
      select * from acid_update_m;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 3          | 30         | 101        |
      | 4          | 400        | 201        |
      | 5          | 300        | 301        |
      +------------+------------+------------+
      
      -- Use the data in the acid_update_s table to update the acid_update_t table. Use the where clause to filter values in the acid_update_s, acid_update_t, and acid_update_m tables.
      update acid_update_t
      set value1 = acid_update_s.value1, value2 = acid_update_s.value2
      from acid_update_s
      where acid_update_t.id = acid_update_s.id
      and acid_update_s.id > 2
      and acid_update_t.value1 not in
      (select value1 from acid_update_m where id = acid_update_t.id)
      and acid_update_s.value1 not in
      (select value1 from acid_update_m where id = acid_update_s.id);
      
      -- View the update result. Only the row of data that contains the value of 5 in the id column of the acid_update_t table meets the condition. The value in the value1 column is updated to 500, and the value in the value2 column is updated to 501.
      select * from acid_update_t;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 5          | 500        | 501        |
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      | 4          | 40         | 41         |
      +------------+------------+------------+
    • Example 6: Create a Delta table named mf_dt, import data into the table, and then execute the UPDATE statement to delete rows that meet the specified conditions from the table. Sample statements:

      -- Create a Delta table named mf_dt. 
      create table if not exists mf_dt (pk bigint not null primary key, 
                        val bigint not null) 
                        partitioned by (dd string, hh string) 
                        tblproperties ("transactional"="true");
      
      -- Insert data into the table.
      insert overwrite table mf_dt partition (dd='01', hh='02') 
                       values (1, 1), (2, 2), (3, 3);
      
      -- Query the table to check whether data is inserted.                 
      select * from mf_dt where dd='01' and hh='02';
      -- The following result is returned:
      +------------+------------+----+----+
      | pk         | val        | dd | hh |
      +------------+------------+----+----+
      | 1          | 1          | 01 | 02 |
      | 3          | 3          | 01 | 02 |
      | 2          | 2          | 01 | 02 |
      +------------+------------+----+----+
      
      -- Update the value of the val column in the row whose value of the pk column is 3 in the dd='01' and hh='02' partition to 30. 
      -- Method 1
      update mf_dt set val = 30 where pk = 3 and dd='01' and hh='02';
      
      -- Method 2
      update mf_dt set val = delta.val 
                    from (select pk, val from values (3, 30) t (pk, val)) delta 
                    where delta.pk = mf_dt.pk and mf_dt.dd='01' and mf_dt.hh='02';
      
      -- View the update result. 
      select * from mf_dt where dd='01' and hh='02';
      -- The following result is returned. The value of the val column in the row whose value of the pk column is 3 is updated to 30. 
      +------------+------------+----+----+
      | pk         | val        | dd | hh |
      +------------+------------+----+----+
      | 1          | 1          | 01 | 02 |
      | 3          | 30         | 01 | 02 |
      | 2          | 2          | 01 | 02 |
      +------------+------------+----+----+

ALTER TABLE COMPACT

The base files and delta files of a transactional table occupy the physical storage. You cannot directly read such files. If you execute the UPDATE or DELETE statement on a transactional table, the data in the base files is not updated, but a delta file is generated for each operation. In this case, the more delete or update operations are performed, the more storage space the table occupies. The number of delta files increases. As a result, you are charged more for storage usage and subsequent queries.

If you execute the UPDATE or DELETE statement on a table or a partition of the table multiple times, a large number of delta files are generated. When the system reads data from the table, the system loads the delta files to identify the deleted and updated rows. A large number of delta files reduce the efficiency of reading data. In this case, you can merge the base files with the delta files to reduce storage usage and improve the read efficiency.

  • Syntax

    alter table <table_name> [partition (<partition_key> = '<partition_value>' [, ...])] compact {minor|major};
  • Parameters

    • table_name: required. The name of the transactional table for which you want to merge the base files and delta files.

    • partition_key: optional. The name of a partition key column in the partitioned transactional table.

    • partition_value: optional. The value of a partition key column in the partitioned transactional table.

    • major|minor: One of them must be specified. Differences between minor compaction and major compaction:

      • minor: merges each base file with all delta files that are generated based on the base file and deletes the delta files.

      • major: merges each base file with all delta files that are generated based on the base file, deletes the delta files, and merges small base files. If the size of a base file is less than 32 MB or a delta file is generated, the effect of merging files is equivalent to the effect of executing the INSERT OVERWRITE statement. However, if the size of a base file is greater than or equal to 32 MB and no delta files are generated, the data of the table is not overwritten.

  • Precaution

    The small files merged through the Compact operation will be deleted after 1 day. If you use the Backup and restoration function to recover historical data that depends on these small files, the recovery will fail due to their absence.

  • Examples

    • Example 1: Merge files of the acid_delete table. Sample statement:

      alter table acid_delete compact minor;

      The following result is returned:

      Summary:
      Nothing found to merge, set odps.merge.cross.paths=true if cross path merge is permitted.
      OK
    • Example 2: Merge files of the acid_update_pt table. Sample statement:

      alter table acid_update_pt partition (ds = '2019') compact major;

      The following result is returned:

      Summary:
      table name: acid_update_pt /ds=2019  instance count: 2  run time: 6
        before merge, file count:        8  file size: 2613  file physical size: 7839
         after merge, file count:        2  file size: 679  file physical size: 2037
      
      OK

FAQ

  • Issue 1:

    • Problem description: When I execute the UPDATE statement, the following error message is returned: ODPS-0010000:System internal error - fuxi job failed, caused by: Data Set should contain exactly one row.

    • Cause: The rows that you want to update do not match the rows that are queried by subqueries. In this case, the system cannot determine which rows need to be updated. Sample statement:

      update store set (s_county, s_manager) = (select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk) where s_store_sk in (select s_store_sk from store_delta);

      The select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk subquery is used to filter the data in the store_delta table. Then, the data that meets the specified condition is used to update the data of the store table. For example, three rows that have the s_store_sk column in the store table are [1, 2, 3]. If the rows that have the s_store_sk column in the store_delta table are [1, 1] and do not match the rows in the store table, the preceding error message is returned.

    • Solution: Make sure that the rows that you want to update exactly match the rows that are queried by subqueries.

  • Issue 2:

    • Problem description: When I run the compact command in DataWorks DataStudio, the following error message is returned: ODPS-0130161:[1,39] Parse exception - invalid token 'minor', expect one of 'StringLiteral','DoubleQuoteStringLiteral'.

    • Cause: The MaxCompute client version that corresponds to the exclusive resource group of DataWorks does not support the compact command.

    • Solution: Join the DataWorks DingTalk group and contact the technical support team to update the MaxCompute client version that corresponds to the exclusive resource group.