MaxCompute allows you to execute the DELETE
or UPDATE
statement to delete or update data of specific rows in transactional tables.
You can execute the statements on the following platforms:
Prerequisites
You are granted the Select and Update permissions on the transactional table on which you want to execute the DELETE
or UPDATE
statement. For more information, see MaxCompute permissions.
Description
Similar to traditional SQL statements, the DELETE
and UPDATE
statements in MaxCompute can be used to delete or update the data of specific rows in tables.
Each time you execute the DELETE
or UPDATE
statement, a delta file is automatically generated to store information about the delete or update operation. This file is invisible to users. The following items describe how the delta files are generated.
DELETE
: A delta file contains thetxnid
androwid
fields, both of which have a value of the BIGINT type. The rowid field indicates the deleted row in the base files of the transactional table. The txnid field indicates the delete operation that is performed on the row.For example, a base file of the t1 table is f1 and the content of the base file is
a, b, c, a, b
. When you execute thedelete from t1 where c1='a';
statement, a delta file namedf1.delta
is generated. If the value of thetxnid
field ist0
, the content of thef1.delta
file is((0, t0), (3, t0))
. This indicates that the rows whose IDs are 0 and 3 are deleted from the t0 transaction. If you execute anotherDELETE
statement on the t1 table again, another delta file namedf2.delta
is generated. The file name is generated based on the f1 base file. When you query the data in the t1 table, the system filters the deleted data based on the f1, f1.delta, and f2.delta files and returns the data that is not deleted.UPDATE
: The logic of anUPDATE
statement is converted into the logic of executing theDELETE
andINSERT INTO
statements.
The DELETE
and UPDATE
statements have the following benefits:
Reduce the amount of data to be written
Before the DELETE and UPDATE statements are provided, MaxCompute allows you to execute only the
INSERT INTO
orINSERT OVERWRITE
statement to update or delete data in tables. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE). If you want to execute anINSERT
statement when you update a small amount of data in a table or a partition of the table, you must first execute aSELECT
statement to read all data from the table and update the data. Then, you can execute theINSERT
statement to insert all data into the table. This method is inefficient. However, if you use theDELETE
orUPDATE
statement in the preceding scenario, the system does not need to write all data in the table. This reduces the amount of data to be written.NoteIf you use the pay-as-you-go billing method, you are not charged for the write operations that are performed by executing the
DELETE
,UPDATE
, orINSERT OVERWRITE
statement. However, when you execute theDELETE
orUPDATE
statement, MaxCompute must filter data by partition and read the data that you want to delete or update. You are charged for the read operations based on the pay-as-you-go billing method of SQL jobs. Therefore, compared with theINSERT OVERWRITE
statement, theDELETE
orUPDATE
statement does not help you reduce costs.If you use the subscription billing method, fewer resources are consumed to write data when you execute the
DELETE
orUPDATE
statement. Compared with theINSERT OVERWRITE
statement, the DELETE or UPDATE statement allows you to run more jobs when the same amount of resources is used.
Read the table with the latest data
Before the DELETE and UPDATE statements are provided, MaxCompute allows you to use history tables to update multiple data entries in a table. If you use a history table, you must add auxiliary columns such as
start_date
andend_date
in the table. These columns indicate the lifecycle of a data entry. To query the latest data of a table, the system must identify the latest data from large amounts of data based on the timestamps. This process is time-consuming. However, you can execute theDELETE
orUPDATE
statement to delete or update data. When you query the data in a table, the system reads the latest data of the table based on the base files and all delta files.
After you execute the DELETE
and UPDATE
statements multiple times on a transactional table, the transactional table occupies a larger storage space. In this case, the costs of the storage and subsequent queries on the table increase. In addition, the efficiency of subsequent queries is reduced. To resolve these issues, we recommend that you execute the ALTER TABLE COMPACT statement to merge the base files with all delta files on a regular basis. For more information, see ALTER TABLE COMPACT.
If multiple jobs are run in parallel on a table, conflicts may occur. For more information, see ACID semantics.
Scenarios
You can execute the DELETE
or UPDATE
statement to delete or update a small amount of data in tables or partitions of the tables at a low frequency. For example, you can execute the statement to delete or update less than 5% of the data in a table or a partition of the table on the next day after the data is generated.
The DELETE
or UPDATE
statement is not applicable if you want to delete or update data at a high frequency or if you want to write data to tables in real time.
Limits
The
DELETE
statement,UPDATE
statement, and a transactional table or Delta table on which the DELETE or UPDATE statement is executed have the following limits:Before you execute the
UPDATE
,DELETE
, orINSERT OVERWRITE
statement on important data in transactional tables, you must execute theSELECT
andINSERT
statements to back up the data to other tables.You cannot execute the
UPDATE
statement to change values in the primary key column of a Delta table.
Precautions
When you execute the DELETE
or UPDATE
statement to delete or update data in tables or partitions of the tables, take note of the following items:
In specific scenarios, you may want to execute the DELETE or UPDATE statement for a small amount of data in a table and infrequently perform read and other operations in subsequent procedures. To reduce the storage space that is occupied by the table, we recommend that you merge the base files with all delta files after you execute the
DELETE
orUPDATE
statement for the table several times. For more information, see ALTER TABLE COMPACT.In specific scenarios, you may want to delete or update more than 5% of the data in a table or a partition of the table at a low frequency and perform frequent read operations in subsequent procedures. We recommend that you execute the
INSERT OVERWRITE
orINSERT INTO
statement in such scenarios. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE).For example, you want to perform delete or update operations for 10% of data 10 times each day. In this case, we recommend that you estimate the total cost and the consumption of the subsequent read performance if you execute the
DELETE
orUPDATE
statement on the table. Then, compare the estimated result with that of executing theINSERT OVERWRITE
orINSERT INTO
statement. This helps you choose an efficient method.Each time you execute the DELETE statement on a table, a delta file is automatically generated. As a result, the occupied storage space may not be reduced. If you want to execute the
DELETE
statement to delete data to reduce storage usage, you can merge the base files with all delta files. For more information, see ALTER TABLE COMPACT.MaxCompute executes multiple
DELETE
andUPDATE
statements in jobs at a time. Each statement consumes resources and incurs fees. We recommend that you delete or update a batch of data at a time. For example, if you run a Python script to generate and submit a large number of row-level update jobs, and each statement is executed for only one row or a small number of rows of data, each statement incurs fees that correspond to the amount of input data scanned by the SQL statement and consumes the related computing resources. When multiple statements are accumulated, the costs are significantly increased and the system efficiency is reduced. Sample statements:We recommend that you execute the following statement:
UPDATE table1 SET col1= (SELECT value1 FROM table2 WHERE table1.id = table2.id AND table1.region = table2.region);
We recommend that you do not execute the following statements:
UPDATE table1 SET col1=1 WHERE id='2021063001'AND region='beijing'; UPDATE table1 SET col1=2 WHERE id='2021063002'AND region='beijing';
DELETE
You can execute the DELETE
statement to delete one or more rows that meet the specified conditions from a transactional table or Delta table.
Syntax
delete from <table_name> [where <where_condition>];
Parameters
table_name: required. The name of the transactional table or Delta table on which you want to execute the
DELETE
statement.where_condition: optional. A WHERE clause that is used to filter data based on conditions. For more information. see WHERE clause (where_condition). If you execute the DELETE statement on a table without a WHERE clause, all data in the table is deleted.
Examples
Example 1: Create a non-partitioned transactional table named acid_delete and insert data into the table. Then, execute the
DELETE
statement to delete the rows that meet the specified conditions from the table. Sample statements:-- Create a non-partitioned transactional table named acid_delete. create table if not exists acid_delete(id bigint) tblproperties ("transactional"="true"); -- Insert data into the table. insert overwrite table acid_delete values(1),(2),(3),(2); -- Query the table to check whether data is inserted. select * from acid_delete; -- The following result is returned: +------------+ | id | +------------+ | 1 | | 2 | | 3 | | 2 | +------------+ -- Delete the rows whose value of the id column is 2. If you execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. delete from acid_delete where id = 2; -- Query the table to check whether the table contains only the rows whose values of the id column are 1 and 3. select * from acid_delete; -- The following result is returned: +------------+ | id | +------------+ | 1 | | 3 | +------------+
Example 2: Create a partitioned transactional table named acid_delete_pt and insert data into the table. Then, execute the
DELETE
statement to delete the rows that meet the specified conditions from the table. Sample statements:-- Create a partitioned transactional table named acid_delete_pt. create table if not exists acid_delete_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true"); -- Add partitions to the table. alter table acid_delete_pt add if not exists partition (ds= '2019'); alter table acid_delete_pt add if not exists partition (ds= '2018'); -- Insert data into the partitions. insert overwrite table acid_delete_pt partition (ds='2019') values(1),(2),(3); insert overwrite table acid_delete_pt partition (ds='2018') values(1),(2),(3); -- Query the table to check whether data is inserted. select * from acid_delete_pt; -- The following result is returned: +------------+------------+ | id | ds | +------------+------------+ | 1 | 2018 | | 2 | 2018 | | 3 | 2018 | | 1 | 2019 | | 2 | 2019 | | 3 | 2019 | +------------+------------+ -- Delete the rows whose values of the id and ds columns are 2 and 2019. If you execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. delete from acid_delete_pt where ds='2019' and id = 2; -- Query the table to check whether the rows whose values of the id and ds columns are 2 and 2019 are deleted. select * from acid_delete_pt; -- The following result is returned: +------------+------------+ | id | ds | +------------+------------+ | 1 | 2018 | | 2 | 2018 | | 3 | 2018 | | 1 | 2019 | | 3 | 2019 | +------------+------------+
Example 3: Create a destination table named acid_delete_t and an associated table named acid_delete_s. Then, delete the rows that meet the specified conditions from the destination table based on the associated table. Sample statements:
-- Create a destination table named acid_delete_t and an associated table named acid_delete_s. create table if not exists acid_delete_t(id int,value1 int,value2 int) tblproperties ("transactional"="true"); create table if not exists acid_delete_s(id int,value1 int,value2 int); -- Insert data into the tables. insert overwrite table acid_delete_t values(2,20,21),(3,30,31),(4,40,41); insert overwrite table acid_delete_s values(1,100,101),(2,200,201),(3,300,301); -- Delete the rows in the acid_delete_t table whose value of the id column does not match that of the rows in the acid_delete_s table. If you want to execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. delete from acid_delete_t where not exists (select * from acid_delete_s where acid_delete_t.id=acid_delete_s.id); -- Query the acid_delete_t table to check whether the table contains only the rows whose values of the id column are 2 and 3. select * from acid_delete_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 20 | 21 | | 3 | 30 | 31 | +------------+------------+------------+
Example 4: Create a Delta table named mf_dt, import data to the table, and then execute the
DELETE
statement to delete rows that meet the specified conditions from the table. Sample statements:-- Create a Delta table named mf_dt. create table if not exists mf_dt (pk bigint not null primary key, val bigint not null) partitioned by (dd string, hh string) tblproperties ("transactional"="true"); -- Insert data into the table. insert overwrite table mf_dt partition (dd='01', hh='02') values (1, 1), (2, 2), (3, 3); -- Query the table to check whether data is inserted. select * from mf_dt where dd='01' and hh='02'; -- The following result is returned: +------------+------------+----+----+ | pk | val | dd | hh | +------------+------------+----+----+ | 1 | 1 | 01 | 02 | | 3 | 3 | 01 | 02 | | 2 | 2 | 01 | 02 | +------------+------------+----+----+ -- Delete the row whose value of the val column is 2 from the dd='01' and hh='02' partition. delete from mf_dt where val = 2 and dd='01' and hh='02'; -- Query the result table. Only the rows whose values of the val column are 1 and 3 exist in the specified partition. select * from mf_dt where dd='01' and hh='02'; -- The following result is returned: +------------+------------+----+----+ | pk | val | dd | hh | +------------+------------+----+----+ | 1 | 1 | 01 | 02 | | 3 | 3 | 01 | 02 | +------------+------------+----+----+
ALTER TABLE CLEAR COLUMN
You can execute the ALTER TABLE CLEAR COLUMN
statement to clear data in columns that are no longer required in a common table from a disk and set the values in the columns to null. This can reduce storage costs.
Syntax
ALTER TABLE <table_name> [partition ( <pt_spec>[, <pt_spec>....] )] CLEAR COLUMN column1[, column2, column3, ...] [without touch];
Parameters
table_name: the name of the table on which you want to execute the ALTER TABLE CLEAR COLUMN statement.
column1 , column2 ...
: the names of columns whose data needs to be cleared.partition: the partition on which you want to execute the ALTER TABLE CLEAR COLUMN statement. If no partition is specified, the statement is executed on all partitions in the table.
pt_spec: the description of the partition. Configure this parameter in the
(partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...)
format.without touch: If you configure this parameter, the
time when the data was last modified
is not updated. Otherwise, thetime
is updated.
NoteBy default, the without touch parameter is configured. You can also delete data from columns in the future even if the without touch parameter is not configured. If you do not configure the without touch parameter, the
time when the data was last modified
is updated.Limits
The ALTER TABLE CLEAR COLUMN statement cannot be executed for columns that have the non-nullable property. You can remove the non-nullable property for columns.
alter table <table_name> change column <old_col_name> null;
The ALTER TABLE CLEAR COLUMN statement cannot be executed on atomicity, consistency, isolation, durability (ACID) tables.
The ALTER TABLE CLEAR COLUMN statement cannot be executed on clustered tables.
The ALTER TABLE CLEAR COLUMN statement cannot be executed to clear data in the subcolumns within a column of the NESTED type.
The ALTER TABLE CLEAR COLUMN statement cannot be executed to clear data in all columns of a table. If you want to clear data in all columns of a table, you can execute the DROP TABLE statement, which provides better performance.
Precautions
The ALTER TABLE CLEAR COLUMN statement does not change the Archive property of a table.
You may fail to execute the ALTER TABLE CLEAR COLUMN statement to clear data in a column of the NESTED type.
The failure occurs if you clear nested data that is in column-oriented storage mode but the mode is disabled for the data.
The execution of the ALTER TABLE CLEAR COLUMN statement depends on online storage services. If you have a large number of jobs to run, the statement may need to queue and require a long period of time to complete.
When you execute the ALTER TABLE CLEAR COLUMN statement, computing resources are consumed to perform data read and write operations. If you use the subscription billing method, your computing resources are occupied. If you use the pay-as-you-go billing method, you are charged for executing the ALTER TABLE CLEAR COLUMN statement based on the same billing rule as an SQL job. The ALTER TABLE CLEAR COLUMN statement for the pay-as-you-go billing method is in invitational preview. Users who use this billing method can execute the statement free of charge.
Examples
-- Create a table. create table if not exists mf_cc(key string, value string, a1 BIGINT , a2 BIGINT , a3 BIGINT , a4 BIGINT) partitioned by(ds string, hr string); -- Add a partition to the table. alter table mf_cc add if not exists partition (ds='20230509', hr='1641'); -- Insert data into the partition. insert into mf_cc partition (ds='20230509', hr='1641') values("key","value",1,22,3,4); -- Query the partition to check whether data is inserted. select * from mf_cc where ds='20230509' and hr='1641'; -- The following result is returned: +-----+-------+------------+------------+--------+------+---------+-----+ | key | value | a1 | a2 | a3 | a4 | ds | hr | +-----+-------+------------+------------+--------+------+---------+-----+ | key | value | 1 | 22 | 3 | 4 | 20230509| 1641| +-----+-------+------------+------------+--------+------+---------+-----+ -- Clear data from columns. alter table mf_cc partition(ds='20230509', hr='1641') clear column key,a1 without touch; -- Query data from the columns. select * from mf_cc where ds='20230509' and hr='1641'; -- The following result is returned. The values of the key and a1 columns become null. +-----+-------+------------+------------+--------+------+---------+-----+ | key | value | a1 | a2 | a3 | a4 | ds | hr | +-----+-------+------------+------------+--------+------+---------+-----+ | null| value | null | 22 | 3 | 4 | 20230509| 1641| +-----+-------+------------+------------+--------+------+---------+-----+
The following figure shows the change of the total storage space of the
lineitem
table whose all columns are cleared one by one by executing the ALTER TABLE CLEAR COLUMN statement multiple times. The data storage format of the lineitem table is AliORC. Thelineitem
table contains 16 columns, which are of the BIGINT, DECIMAL, CHAR, DATE, and VARCHAR data types.After the ALTER TABLE CLEAR COLUMN statement is repeatedly executed to change the values of all columns in the table to null in sequence, the storage space that is occupied by the table changes from 186,783,526 bytes to 236,715 bytes, reduced by 99.97%.
NoteThe size of the storage space that is released by executing the ALTER TABLE CLEAR COLUMN statement is related to the data types of columns and the values in the columns. In this example, the storage space of the DECIMAL-type
l_extendedprice
column is changed from 146,538,799 bytes to 111,138,117 bytes, reduced by 24.2%. This is significantly greater than the average percentage of the storage space that is released.After the values of all columns are set to null, the storage space that is occupied by the table becomes 236,715 bytes, instead of 0. This is because the structures of files generated for the table still exist. Null values also occupy a small amount of storage space, and the system retains the footer information of the files.
UPDATE
You can execute the UPDATE
statement to update the values of one or more columns of the rows that meet the specified conditions in transactional tables or Delta tables.
Syntax
-- Method 1 update <table_name> set <col1_name> = <value1> [, <col2_name> = <value2> ...] [WHERE <where_condition>]; -- Method 2 update <table_name> set (<col1_name> [, <col2_name> ...]) = (<value1> [, <value2> ...])[WHERE <where_condition>]; -- Method 3 UPDATE <table_name> SET <col1_name> = <value1> [ , <col2_name> = <value2> , ... ] [ FROM <additional_tables> ] [ WHERE <where_condition> ]
Parameters
table_name: required. The name of the transactional table on which you want to execute the
UPDATE
statement.col1_name and col2_name: the columns that you want to update. You must specify at least one column. This parameter specifies the names of the columns that contain the row that meets specified filter conditions.
value1 and value2: the new values that you want to assign to the specified columns. You must update the value of at least one column. This parameter specifies the new values of the columns corresponding to the row that meets specified filter conditions.
where_condition: optional. A WHERE clause that is used to filter data based on conditions. For more information, see WHERE clause (where_condition). If you execute the UPDATE statement on a table without a WHERE clause, all data in the table is updated.
additional_tables: optional. A from clause.
The
UPDATE
statement can be used with the from clause. The from clause makes the use of theUPDATE
statement more convenient. The following table describes the UPDATE statements that are used with and without the from clause.Scenario
Sample code
Without the from clause
update target set v = (select min(v) from src group by k where target.k = src.key) where target.k in (select k from src);
With the from clause
update target set v = b.v from (select k, min(v) v from src group by k) b where target.k = b.k;
The following conclusions are obtained based on the preceding sample code:
If you use multiple rows of data in the source table to update one row of data in the destination table, you must write aggregate operations to ensure the uniqueness of the data source. The code of aggregate operations is simpler and easier to understand when you use the from clause than that when you do not use the from clause.
If you need to only update the intersection data during join operations, you must use a from clause or a where clause. The where clause is more complex than the from clause.
Examples
Example 1: Create a non-partitioned table named acid_update and insert data into the table. Then, execute the
UPDATE
statement to update the columns of the rows that meet the specified conditions in the table. Sample statements:-- Create a non-partitioned table named acid_update. create table if not exists acid_update(id bigint) tblproperties ("transactional"="true"); -- Insert data into the table. insert overwrite table acid_update values(1),(2),(3),(2); -- Query the table to check whether data is inserted. select * from acid_update; -- The following result is returned: +------------+ | id | +------------+ | 1 | | 2 | | 3 | | 2 | +------------+ -- Update the value 2 in the id column to 4. update acid_update set id = 4 where id = 2; -- Query the partition to check whether the value 2 is updated to 4 in the id column. select * from acid_update; -- The following result is returned: +------------+ | id | +------------+ | 1 | | 3 | | 4 | | 4 | +------------+
Example 2: Create a partitioned table named acid_update and insert data into the table. Then, execute the
UPDATE
statement to update the columns of the rows that meet the specified conditions in the table. Sample statements:-- Create a partitioned table named acid_update_pt. create table if not exists acid_update_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true"); -- Add a partition to the table. alter table acid_update_pt add if not exists partition (ds= '2019'); -- Insert data into the partition. insert overwrite table acid_update_pt partition (ds='2019') values(1),(2),(3); -- Query the partition to check whether data is inserted. select * from acid_update_pt where ds = '2019'; -- The following result is returned: +------------+------------+ | id | ds | +------------+------------+ | 1 | 2019 | | 2 | 2019 | | 3 | 2019 | +------------+------------+ -- Update the value 2 of the id column in the 2019 partition to 4. update acid_update_pt set id = 4 where ds = '2019' and id = 2; -- Query the partition to check whether the value 2 is updated to 4 in the id column. select * from acid_update_pt where ds = '2019'; -- The following result is returned: +------------+------------+ | id | ds | +------------+------------+ | 4 | 2019 | | 1 | 2019 | | 3 | 2019 | +------------+------------+
Example 3: Create a transactional table named acid_update_t that you want to update and an associated table named acid_update_s. Then, update the values of multiple columns at a time in the acid_update_t table. Sample statements:
-- Create a transactional table named acid_update_t that you want to update and an associated table named acid_update_s. create table if not exists acid_update_t(id int,value1 int,value2 int) tblproperties ("transactional"="true"); create table if not exists acid_update_s(id int,value1 int,value2 int); -- Insert data into the tables. insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41); insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301); -- Method 1: Update the values of specific columns with constants. update acid_update_t set (value1, value2) = (60,61); -- Query the acid_update_t table to check whether data is updated as expected. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 60 | 61 | | 3 | 60 | 61 | | 4 | 60 | 61 | +------------+------------+------------+ -- Method 2: Use the data in the acid_update_s table to update all rows in the acid_update_t table. If specific rows in the acid_update_t table cannot be matched, null values are assigned to the rows. update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id); -- Query the acid_update_t table to check whether data is updated as expected. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 200 | 201 | | 3 | 300 | 301 | | 4 | NULL | NULL | +------------+------------+------------+ -- Method 3: Use the data in the acid_update_s table to update the acid_update_t table. Add filter conditions for the acid_update_t table to update only the rows that intersect with those in the acid_update_s table. update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id) where acid_update_t.id in (select id from acid_update_s); -- Query the acid_update_t table to check whether data is updated as expected. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 200 | 201 | | 3 | 300 | 301 | | 4 | NULL | NULL | +------------+------------+------------+ -- Method 4: Use the aggregate results of the acid_update_t and acid_update_s tables to update the acid_update_t table. update acid_update_t set (id, value1, value2) = (select id, max(value1),max(value2) from acid_update_s where acid_update_t.id = acid_update_s.id group by acid_update_s.id) where acid_update_t.id in (select id from acid_update_s); -- Query the acid_update_t table to check whether data is updated as expected. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 200 | 201 | | 3 | 300 | 301 | | 4 | NULL | NULL | +------------+------------+------------+
Example 4: Perform join operations on two tables. Sample statements:
-- Create a destination table named acid_update_t and create a table named acid_update_s for joins. create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true"); create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint); -- Insert data into these tables. insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41); insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301); -- Query data from the acid_update_t table. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 20 | 21 | | 3 | 30 | 31 | | 4 | 40 | 41 | +------------+------------+------------+ -- Query data from the acid_update_s table. select * from acid_update_s; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 1 | 100 | 101 | | 2 | 200 | 201 | | 3 | 300 | 301 | +------------+------------+------------+ -- Use the data in the acid_update_s table to update the acid_update_t table. Add filter conditions for the acid_update_t table to update only the intersection of the two tables. update acid_update_t set value1 = b.value1, value2 = b.value2 from acid_update_s b where acid_update_t.id = b.id; -- Value 20 is updated to 200, value 21 is updated to 201, value 30 is updated to 300, and value 31 is updated to 301. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 4 | 40 | 41 | | 2 | 200 | 201 | | 3 | 300 | 301 | +------------+------------+------------+
Example 5: Perform complex join operations on more than two tables. Sample statements:
-- Create a destination table named acid_update_t and create a table named acid_update_s for joins. create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true"); create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint); create table if not exists acid_update_m(id bigint,value1 bigint,value2 bigint); -- Insert data into these tables. insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41),(5,50,51); insert overwrite table acid_update_s values (1,100,101),(2,200,201),(3,300,301),(4,400,401),(5,500,501); insert overwrite table acid_update_m values(3,30,101),(4,400,201),(5,300,301); -- Query data from the acid_update_t table. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 2 | 20 | 21 | | 3 | 30 | 31 | | 4 | 40 | 41 | | 5 | 50 | 51 | +------------+------------+------------+ -- Query data from the acid_update_s table. select * from acid_update_s; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 1 | 100 | 101 | | 2 | 200 | 201 | | 3 | 300 | 301 | | 4 | 400 | 401 | | 5 | 500 | 501 | +------------+------------+------------+ -- Query data from the acid_update_m table. select * from acid_update_m; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 3 | 30 | 101 | | 4 | 400 | 201 | | 5 | 300 | 301 | +------------+------------+------------+ -- Use the data in the acid_update_s table to update the acid_update_t table. Use the where clause to filter values in the acid_update_s, acid_update_t, and acid_update_m tables. update acid_update_t set value1 = acid_update_s.value1, value2 = acid_update_s.value2 from acid_update_s where acid_update_t.id = acid_update_s.id and acid_update_s.id > 2 and acid_update_t.value1 not in (select value1 from acid_update_m where id = acid_update_t.id) and acid_update_s.value1 not in (select value1 from acid_update_m where id = acid_update_s.id); -- View the update result. Only the row of data that contains the value of 5 in the id column of the acid_update_t table meets the condition. The value in the value1 column is updated to 500, and the value in the value2 column is updated to 501. select * from acid_update_t; -- The following result is returned: +------------+------------+------------+ | id | value1 | value2 | +------------+------------+------------+ | 5 | 500 | 501 | | 2 | 20 | 21 | | 3 | 30 | 31 | | 4 | 40 | 41 | +------------+------------+------------+
Example 6: Create a Delta table named mf_dt, import data into the table, and then execute the
UPDATE
statement to delete rows that meet the specified conditions from the table. Sample statements:-- Create a Delta table named mf_dt. create table if not exists mf_dt (pk bigint not null primary key, val bigint not null) partitioned by (dd string, hh string) tblproperties ("transactional"="true"); -- Insert data into the table. insert overwrite table mf_dt partition (dd='01', hh='02') values (1, 1), (2, 2), (3, 3); -- Query the table to check whether data is inserted. select * from mf_dt where dd='01' and hh='02'; -- The following result is returned: +------------+------------+----+----+ | pk | val | dd | hh | +------------+------------+----+----+ | 1 | 1 | 01 | 02 | | 3 | 3 | 01 | 02 | | 2 | 2 | 01 | 02 | +------------+------------+----+----+ -- Update the value of the val column in the row whose value of the pk column is 3 in the dd='01' and hh='02' partition to 30. -- Method 1 update mf_dt set val = 30 where pk = 3 and dd='01' and hh='02'; -- Method 2 update mf_dt set val = delta.val from (select pk, val from values (3, 30) t (pk, val)) delta where delta.pk = mf_dt.pk and mf_dt.dd='01' and mf_dt.hh='02'; -- View the update result. select * from mf_dt where dd='01' and hh='02'; -- The following result is returned. The value of the val column in the row whose value of the pk column is 3 is updated to 30. +------------+------------+----+----+ | pk | val | dd | hh | +------------+------------+----+----+ | 1 | 1 | 01 | 02 | | 3 | 30 | 01 | 02 | | 2 | 2 | 01 | 02 | +------------+------------+----+----+
ALTER TABLE COMPACT
The base files and delta files of a transactional table occupy the physical storage. You cannot directly read such files. If you execute the UPDATE
or DELETE
statement on a transactional table, the data in the base files is not updated, but a delta file is generated for each operation. In this case, the more delete or update operations are performed, the more storage space the table occupies. The number of delta files increases. As a result, you are charged more for storage usage and subsequent queries.
If you execute the UPDATE
or DELETE
statement on a table or a partition of the table multiple times, a large number of delta files are generated. When the system reads data from the table, the system loads the delta files to identify the deleted and updated rows. A large number of delta files reduce the efficiency of reading data. In this case, you can merge the base files with the delta files to reduce storage usage and improve the read efficiency.
Syntax
alter table <table_name> [partition (<partition_key> = '<partition_value>' [, ...])] compact {minor|major};
Parameters
table_name: required. The name of the transactional table for which you want to merge the base files and delta files.
partition_key: optional. The name of a partition key column in the partitioned transactional table.
partition_value: optional. The value of a partition key column in the partitioned transactional table.
major|minor: One of them must be specified. Differences between minor compaction and major compaction:
minor
: merges each base file with all delta files that are generated based on the base file and deletes the delta files.major
: merges each base file with all delta files that are generated based on the base file, deletes the delta files, and merges small base files. If the size of a base file is less than 32 MB or a delta file is generated, the effect of merging files is equivalent to the effect of executing theINSERT OVERWRITE
statement. However, if the size of a base file is greater than or equal to 32 MB and no delta files are generated, the data of the table is not overwritten.
Precaution
The small files merged through the Compact operation will be deleted after 1 day. If you use the Backup and restoration function to recover historical data that depends on these small files, the recovery will fail due to their absence.
Examples
Example 1: Merge files of the acid_delete table. Sample statement:
alter table acid_delete compact minor;
The following result is returned:
Summary: Nothing found to merge, set odps.merge.cross.paths=true if cross path merge is permitted. OK
Example 2: Merge files of the acid_update_pt table. Sample statement:
alter table acid_update_pt partition (ds = '2019') compact major;
The following result is returned:
Summary: table name: acid_update_pt /ds=2019 instance count: 2 run time: 6 before merge, file count: 8 file size: 2613 file physical size: 7839 after merge, file count: 2 file size: 679 file physical size: 2037 OK
FAQ
Issue 1:
Problem description: When I execute the
UPDATE
statement, the following error message is returned:ODPS-0010000:System internal error - fuxi job failed, caused by: Data Set should contain exactly one row
.Cause: The rows that you want to update do not match the rows that are queried by subqueries. In this case, the system cannot determine which rows need to be updated. Sample statement:
update store set (s_county, s_manager) = (select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk) where s_store_sk in (select s_store_sk from store_delta);
The
select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk
subquery is used to filter the data in the store_delta table. Then, the data that meets the specified condition is used to update the data of the store table. For example, three rows that have the s_store_sk column in the store table are[1, 2, 3]
. If the rows that have the s_store_sk column in the store_delta table are[1, 1]
and do not match the rows in the store table, the preceding error message is returned.Solution: Make sure that the rows that you want to update exactly match the rows that are queried by subqueries.
Issue 2:
Problem description: When I run the
compact
command in DataWorks DataStudio, the following error message is returned:ODPS-0130161:[1,39] Parse exception - invalid token 'minor', expect one of 'StringLiteral','DoubleQuoteStringLiteral'
.Cause: The MaxCompute client version that corresponds to the exclusive resource group of DataWorks does not support the
compact
command.Solution: Join the DataWorks DingTalk group and contact the technical support team to update the MaxCompute client version that corresponds to the exclusive resource group.