Insert data into or overwrite data in a table or a static partition (INSERT INTO and INSERT OVERWRITE) - MaxCompute

MaxCompute allows you to execute the INSERT INTO or INSERT OVERWRITE statement to insert data into or overwrite data in a table or a static partition.

You can execute the statements on the following platforms:

Prerequisites

Before you execute the INSERT INTO or INSERT OVERWRITE statement, make sure that you are granted the Update permission on the destination table and the Select permission on the metadata of the source table. For more information, see MaxCompute permissions.

Features

When you use MaxCompute SQL to process data, you can execute the INSERT INTO or INSERT OVERWRITE statement to save the execution results of the SELECT statements to the destination table. Differences between the two statements:

INSERT INTO: inserts data into a table or a static partition of a table. You can specify the values of partition key columns in this statement to insert data into a specified partition. If you want to insert a small amount of test data, you can use this statement with VALUES.
INSERT OVERWRITE: clears a specified table or static partitions and inserts data into the table or the static partitions of the table.
Note
- The INSERT syntax in MaxCompute is different from that in MySQL or Oracle. You must add the TABLE keyword and table_name to INSERT OVERWRITE. You do not need to add the TABLE keyword to INSERT INTO.
- If you execute the INSERT OVERWRITE statement on the same partition several times, the size of the partition into which data is inserted may be different each time you run the DESC command. This is because the logic to split files changes after you execute the SELECT and INSERT OVERWRITE statements in sequence for the same partition in a table. After you execute the INSERT OVERWRITE statement, the total length of data remains the same. This does not affect the storage fees.

For information about how to insert data into a dynamic partition, see Insert or overwrite data into dynamic partitions (DYNAMIC PARTITION).

Limits

When you execute the INSERT INTO statement or the INSERT OVERWRITE statement to insert or update data into a table or a static partition of a table, take note of the following limits:
- INSERT INTO: This statement cannot be used to insert data into a clustered table.
- INSERT OVERWRITE: This statement does not allow you to specify the columns into which you want to insert data. You can execute the INSERT INTO statement to specify the columns. For example, if you execute create table t(a string, b string); insert into t(a) values ('1');, 1 is inserted into Column a, and NULL or the default value is inserted into Column b.
- MaxCompute does not provide the locking mechanism for the tables for which INSERT operations are being performed. We recommend that you do not execute the INSERT INTO statement or the INSERT OVERWRITE statement for a table at the same time.

The INSERT INTO statement and the INSERT OVERWRITE statement have the following limits on Delta tables:
- When you execute the INSERT OVERWRITE statement to insert data into a Delta table, the system removes duplicate data from multiple rows of data that have the same primary key value and inserts only the row that is ranked first into the table during the computing process. You cannot specify the data record that will be inserted into the table. The INSERT OVERWRITE statement inserts full data into the Delta table. Default deduplication can ensure the uniqueness of primary key values.
- When you execute the INSERT INTO statement to insert data into a Delta table, the system does not deduplicate data that has the same primary key value but inserts all the data into the table by default. If you set Flag(odps.sql.insert.acidtable.deduplicate.enable) to true, the system deduplicates the data first.

Syntax

insert {into|overwrite} table <table_name> [partition (<pt_spec>)] [(<col_name> [,<col_name> ...)]]
<select_statement>
from <from_statement>
[zorder by <zcol_name> [, <zcol_name> ...]];

table_name: required. The names of the tables into which you want to insert data.
pt_spec: optional. The partition into which you want to insert data. Only constants are supported. Expressions, such as functions, are not supported. The value of this parameter is in the (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...) format.
col_name: optional. The name of the column in the table into which you want to insert data. INSERT OVERWRITE does not allow you to specify [(<col_name> [,<col_name> ...)].
select_statement: required. The SELECT clause that is used to query the data that you want to insert into the destination table from the source table. For more information about the SELECT clause, see SELECT syntax.
Note
- The mappings between the source and destination tables are based on the column sequence in select_statement, instead of the mappings between column names in the tables.
- If the destination table has static partitions and you want to insert data into a static partition, you cannot include partition key columns in select_statement.
from_statement: required. The FROM clause. This clause specifies the data source. For example, you can specify the name of a source table in this clause.
zorder by <zcol_name> [, <zcol_name> ...]: optional. If you write data to a table or a partition, you can use this clause to co-locate rows that have similar data records based on the columns specified in select_statement. This improves filtering performance for queries and reduces storage costs. The ORDER BY x, y clause sorts data records based on the sequence of x coming before y. The ZORDER BY x, y clause co-locates rows that have similar x values and rows that have similar y values. For column-based data filtering and sorting in an SQL SELECT statement, the ORDER BY clause filters and sorts data based on x, whereas the ZORDER BY clause filters and sorts data based on x or based on both x and y. This increases the column compression ratio.

The ZORDER BY clause supports two modes: local Z-Ordering and global Z-Ordering. The default mode is local Z-Ordering. In local Z-Ordering mode, only data in a single file, instead of global data, is sorted based on the ZORDER BY clause. If data is distributed in multiple files, data is loosely aggregated. As a result, data skipping may be performed in a less effective manner. To resolve this issue, MaxCompute of the latest version supports the global Z-Ordering mode.
- Local Z-Ordering.
- Global Z-Ordering: To use this mode, you must add the configuration set odps.sql.default.zorder.type=global;.
The SORT BY clause specifies the method used to sort data in a single file. If you do not specify the SORT BY clause, data in a single file is sorted in local Z-Ordering mode.
When you use the ZORDER BY clause, take note of the following limits:
- You can execute the ZORDER BY clause to sort data in only one partition in a partitioned table at a time.
- The number of fields for which you execute the ZORDER BY clause must range from 2 to 4.
- If the destination table is a clustered table, the ZORDER BY clause is not supported.
- ZORDER BY can be used together with DISTRIBUTE BY but cannot be used together with ORDER BY, CLUSTER BY, or SORT BY.
Note
If you use the ZORDER BY clause to insert data, more resources and time are consumed.

Examples: common tables

Example 1: Execute the INSERT INTO statement to append data to a non-partitioned table named websites. Sample statements:

-- Create a non-partitioned table named websites. 
create table if not exists websites
(id int,
 name string,
 url string
);
-- Create a non-partitioned table named apps.
create table if not exists apps
(id int,
 app_name string,
 url string
);
-- Append data to the apps table. The abbreviated form of INSERT INTO TABLE table_name is INSERT INTO table_name.
insert into apps (id,app_name,url) values 
(1,'Aliyun','https://www.aliyun.com');
-- Copy data from the apps table and append the data to the websites table.
insert into websites (id,name,url) select id,app_name,url
from  apps;
-- Execute the SELECT statement to view data in the websites table. 
select * from websites;
-- The following result is returned: 
+------------+------------+------------+
| id         | name       | url        |
+------------+------------+------------+
| 1          | Aliyun     | https://www.aliyun.com |
+------------+------------+------------+

Example 2: Execute the INSERT INTO statement to append data to a partitioned table named sale_detail. Sample statements:

-- Create a partitioned table named sale_detail. 
create table if not exists sale_detail
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);

-- Add a partition to the sale_detail table. This operation is optional. If you do not create a partition in advance, a partition is automatically created when you write data to the table. 
alter table sale_detail add partition (sale_date='2013', region='china');

-- Append data to the sale_detail table. The abbreviated form of INSERT INTO TABLE table_name is INSERT INTO table_name. The TABLE keyword in INSERT OVERWRITE TABLE table_name cannot be omitted. 
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);

-- Enable a full table scan only for the current session. Execute the SELECT statement to view data in the sale_detail table. 
set odps.sql.allow.fullscan=true; 
select * from sale_detail;

-- The following result is returned: 
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

Example 3: Execute the INSERT OVERWRITE statement to overwrite the data in the sale_detail_insert table. Sample statements:

-- Create the sale_detail_insert table that has the same schema as the sale_detail table. 
create table sale_detail_insert like sale_detail;

-- Add a partition to the sale_detail_insert table. This operation is optional. If you do not create a partition in advance, a partition is automatically created when you write data to the table. 
alter table sale_detail_insert add partition (sale_date='2013', region='china');

-- Extract data from the sale_detail table and insert the data into the sale_detail_insert table. Names of partition key columns in the sale_detail_insert table do not need to be declared and cannot be rearranged. 
-- If the sale_detail_insert table contains static partitions, the values of partition key columns are declared in PARTITION(). These values do not need to be included in select_statement. You need to search for column names based on only the sequence of common columns in the sale_detail_insert table and sequentially map the declared column values to the columns in the sale_detail_insert table. If the sale_detail_insert table contains dynamic partitions, the names of partition key columns must be included in select_statement. For more information, see Insert or overwrite data into dynamic partitions (DYNAMIC PARTITION). 
insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
  select 
  shop_name, 
  customer_id,
  total_price 
  from sale_detail
  zorder by customer_id, total_price;

-- Enable a full table scan only for the current session. Execute the SELECT statement to view the data in the sale_detail_insert table. 
set odps.sql.allow.fullscan=true;
select * from sale_detail_insert;

-- The following result is returned: 
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

Example 4: Execute the INSERT OVERWRITE statement to overwrite the data in the sale_detail_insert table and adjust the sequence of columns in select_statement. The mappings between the source and destination tables are based on the sequence of columns in select_statement, instead of the mappings between column names in the two tables. Sample statements:
```
insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
    select customer_id, shop_name, total_price from sale_detail;    

set odps.sql.allow.fullscan=true;
select * from sale_detail_insert;                  
```
The following result is returned:
```
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| c1         | s1          | 100.1       | 2013       | china      |
| c2         | s2          | 100.2       | 2013       | china      |
| c3         | s3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+
```
When you create the sale_detail_insert table, the column sequence is defined as shop_name string, customer_id string, and then total_price bigint. However, data is inserted from the sale_detail table into the sale_detail_insert table based on the sequence of customer_id, shop_name, and then total_price. As a result, the data in the sale_detail.customer_id column is inserted into the sale_detail_insert.shop_name column, and the data in the sale_detail.shop_name column is inserted into the sale_detail_insert.customer_id column.
Example 5: If you insert data into a partition, the partition key columns of the partition cannot be included in select_statement. After the following statement is executed, an error is returned. This is because sale_date and region are partition key columns. These columns cannot be included in select_statement if the INSERT OVERWRITE or INSERT INTO statement is used to insert or update data into a static partition. Sample statement of incorrect usage:
```
insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
   select shop_name, customer_id, total_price, sale_date, region from sale_detail;
```

Example 6: pt_spec in PARTITION() must be constants instead of expressions. Sample statement of incorrect usage:

insert overwrite table sale_detail_insert partition (sale_date=datepart('2016-09-18 01:10:00', 'yyyy') , region='china')
   select shop_name, customer_id, total_price from sale_detail;

Example 7: Execute the INSERT OVERWRITE statement to overwrite data in the mf_src table and data to the mf_zorder_src table and sort data in the mf_zorder_src table in global Z-Ordering mode. Sample statements:

-- Create the mf_src table, insert data into the table, and then query data from the table. 
create table mf_src (key string, value string);

insert overwrite table mf_src
select a, b from values ('1', '1'),('3', '3'),('2', '2')
as t(a, b);

select * from mf_src;

-- The following result is returned:
+-----+-------+
| key | value |
+-----+-------+
| 1   | 1     |
| 3   | 3     |
| 2   | 2     |
+-----+-------+

-- Create the mf_zorder_src table that has the same schema as the mf_src table. 
create table mf_zorder_src like mf_src;

-- Insert data into the mf_zorder_src table, sort data in the table in global Z-Ordering mode, and then query data from the table. 
set odps.sql.default.zorder.type=global;
insert overwrite table mf_zorder_src
select key, value from mf_src 
zorder by key, value;

select * from mf_zorder_src;

-- The following result is returned:
+-----+-------+
| key | value |
+-----+-------+
| 1   | 1     |
| 2   | 2     |
| 3   | 3     |
+-----+-------+

Example 8: Execute the INSERT OVERWRITE statement to overwrite data in an existing table named target. Sample statements:

-- Overwrite data in an existing table named target.
set odps.sql.default.zorder.type=global;
insert overwrite table target
select key, value from target 
zorder by key, value;

Examples: Delta tables

Create a Delta table named mf_dt and execute the INSERT statements to insert data into and overwrite data in the table.

-- Create a Delta table named mf_dt. 
create table if not exists mf_dt (pk bigint not null primary key, 
                  val bigint not null) 
                  partitioned by (dd string, hh string) 
                  tblproperties ("transactional"="true");
                  
-- Insert test data into the partition whose partition key columns dd and hh are both 01 in the mf_dt table. 
insert overwrite table mf_dt partition (dd='01', hh='01') 
                 values (1, 1), (2, 2), (3, 3);

-- Query data from the destination partition in the mf_dt table.                
select * from mf_dt where dd='01' and hh='01';
-- The following result is returned:
+------------+------------+----+----+
| pk         | val        | dd | hh |
+------------+------------+----+----+
| 1          | 1          | 01 | 01 |
| 3          | 3          | 01 | 01 |
| 2          | 2          | 01 | 01 |
+------------+------------+----+----+

-- Execute the INSERT INTO statement to append data to the destination partition in the mf_dt table. 
insert into table mf_dt partition(dd='01', hh='01') 
            values (3, 30), (4, 4), (5, 5);
            
select * from mf_dt where dd='01' and hh='01';
-- The following result is returned:
+------------+------------+----+----+
| pk         | val        | dd | hh |
+------------+------------+----+----+
| 1          | 1          | 01 | 01 |
| 3          | 30         | 01 | 01 |
| 4          | 4          | 01 | 01 |
| 5          | 5          | 01 | 01 |
| 2          | 2          | 01 | 01 |
+------------+------------+----+----+

-- Execute the INSERT OVERWRITE statement to overwrite data in the destination partition of the mf_dt table. 
insert overwrite table mf_dt partition (dd='01', hh='01') 
                 values (1, 1), (2, 2), (3, 3);
select * from mf_dt where dd='01' and hh='02';
-- The following result is returned: 
+------------+------------+----+----+
| pk         | val        | dd | hh |
+------------+------------+----+----+
| 1          | 1          | 01 | 01 |
| 3          | 3          | 01 | 01 |
| 2          | 2          | 01 | 01 |
+------------+------------+----+----+

-- Execute the INSERT INTO statement to write data to the partition whose partition key column dd is 01 and partition key column hh is 02 in the mf_dt table. 
insert overwrite table mf_dt partition (dd='01', hh='02') 
                 values (1, 11), (2, 22), (3, 32);
select * from mf_dt where dd='01' and hh='02';
-- The following result is returned: 
+------------+------------+----+----+
| pk         | val        | dd | hh |
+------------+------------+----+----+
| 1          | 11         | 01 | 02 |
| 3          | 32         | 01 | 02 |
| 2          | 22         | 01 | 02 |
+------------+------------+----+----+

-- Enable a full table scan only for the current session. Execute the SELECT statement to query data in the mf_dt table. 
set odps.sql.allow.fullscan=true;
select * from mf_dt;
-- The following result is returned: 
+------------+------------+----+----+
| pk         | val        | dd | hh |
+------------+------------+----+----+
| 1          | 11         | 01 | 02 |
| 3          | 32         | 01 | 02 |
| 2          | 22         | 01 | 02 |
| 1          | 1          | 01 | 01 |
| 3          | 3          | 01 | 01 |
| 2          | 2          | 01 | 01 |
+------------+------------+----+----+

Best practices

The Z-Ordering feature is not suitable for all business scenarios. MaxCompute does not provide a guideline on whether and how to use the Z-Ordering feature. In most cases, you must determine whether to use the Z-Ordering feature based on your business requirements. You must also comprehensively evaluate whether the additional computing costs that are generated when you sort data by using the Z-Ordering feature is less than the costs that are reduced in terms of storage and downstream consumption and computing. The following descriptions provide suggestions on using the Z-Ordering feature based on experience. You can also provide your suggestions and feedback.

Scenarios in which you need to preferentially use clustered indexing rather than Z-Ordering

If a filter condition is composed of combinations of prefixes, such as a, the combination of a and b, or the combination of a, b, and c, clustered indexing (ORDER BY a, b, c) is more effective than the ZORDER BY clause. The ORDER BY clause provides a better sorting effect on the first field but has small impacts on the remaining fields. The ZORDER BY clause provides the same weight for each field, and its sorting result on a specific field is less effective than the sorting result of the ORDER BY clause on the first column.
If some fields are frequently used as join keys, range clustering or hash clustering is more suitable for the fields. The Z-Ordering feature of MaxCompute can be used to sort data only in a single file. The SQL engine is imperceptible to data that is sorted by using the Z-Ordering feature but is perceptible to data that is sorted by using clustered indexing. You can better optimize the performance of joins by using clustered indexing in the stage of query planning.
If you want to frequently execute the GROUP BY or ORDER BY clause for specific fields, clustered indexing can provide better performance.

Suggestions on using the Z-Ordering feature

Select the fields that are frequently used in filter conditions, especially fields that are frequently joined for filtering.
Make sure that the number of fields for which you execute the ZORDER BY clause does not exceed 4. If you execute the ZORDER BY clause on a large number of fields, the sorting performance for each field becomes poor. If only one field needs to be sorted, we recommend that you use clustered indexing rather than Z-Ordering.
The number of distinct values of the fields that are selected cannot be excessively large or small. For example, the gender field has only two distinct values. In this case, sorting of the values is meaningless. However, if a field has an excessively large number of distinct values, sorting of the values generates high costs because the ZORDER BY clause needs to cache all values of the field in the memory to calculate the Z value.
Make sure that the data amount of the table that you want to sort is not excessively large or small. If the data amount of the table that you want to sort is excessively small, the effect of sorting by using Z-Ordering is not apparent. If the data amount of the table that you want to sort is excessively large, high costs are generated when data is sorted by using Z-Ordering. For example, the output of a baseline task may be delayed.