Import data by using INSERT INTO statements - ApsaraDB for SelectDB

ApsaraDB for SelectDB is compatible with the standard SQL syntax. You can use a standard INSERT INTO statement to import data.

Background information

INSERT INTO is a common method of importing data into databases such as MySQL databases. ApsaraDB for SelectDB is compatible with the standard SQL syntax and allows you to execute INSERT INTO statements to import data. The INSERT INTO statements are divided into the following two types:

INSERT INTO tbl SELECT ...
INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);
Important
We recommend that you do not use this type of INSERT INTO statements in a production environment.

INSERT INTO SELECT statement

The INSERT INTO SELECT statement can efficiently calculate and process data in ApsaraDB for SelectDB and data in external data lakes by using a large number of SQL functions and federated queries that are provided by ApsaraDB for SelectDB. Then, the processed data is imported to a new table in ApsaraDB for SelectDB for further analysis.

Perform ETL operations on data in an internal table

If the data is already stored in a table in ApsaraDB for SelectDB, you can perform extract, transform, and load (ETL) operations on the data by executing the INSERT INTO SELECT statement and then import the data to a new table in ApsaraDB for SelectDB. Sample code:

INSERT INTO bj_store_sales
SELECT id, total, user_id, sale_timestamp FROM store_sales WHERE region = "bj";

Synchronize data from a data lake

If data is stored in an external system such as a data lake, you can create a catalog in ApsaraDB for SelectDB, map the catalog to the data in the external system, and then execute the INSERT INTO SELECT statement to import the data to a table in ApsaraDB for SelectDB. ApsaraDB for SelectDB supports data sources such as Hive, Iceberg, Hudi, Elasticsearch, and Java Database Connectivity (JDBC). For more information, see Data lakehouse.

In the following example, a Hive data source is used to show how to synchronize data from a data lake to a table in ApsaraDB for SelectDB.

Create a catalog for the Hive data source to access data by performing federated queries. Sample code:

CREATE CATALOG test_catalog comment 'hive catalog' PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://127.0.0.1:7004',
    'dfs.nameservices'='HANN',
    'dfs.ha.namenodes.HANN'='nn1,nn2',
    'dfs.namenode.rpc-address.HANN.nn1'='nn1_host:rpc_port',
    'dfs.namenode.rpc-address.HANN.nn2'='nn2_host:rpc_port',
    'dfs.client.failover.proxy.provider.HANN'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

Execute the INSERT INTO SELECT statement to synchronize data from the Hive data source to the table in ApsaraDB for SelectDB and specify a unique label for the data import job.

INSERT INTO bj_store_sales
WITH LABEL test_label
SELECT id, total, user_id, sale_timestamp FROM test_catalog.test_db.store_sales WHERE region = "bj";

INSERT INTO VALUES statement

The INSERT INTO VALUES statement is a common method of writing data to databases such as MySQL databases. We recommend that you use the INSERT INTO VALUES statement only in a test environment. Typically, you can send a request for writing data by using a SQL client or a JDBC program.

The following sample code provides an example on how to create a table to which data is imported in ApsaraDB for SelectDB.

CREATE TABLE test_table
(
    id int,
    name varchar(50),
    age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

Import data by using a SQL client

BEGIN;
INSERT INTO db.tbl VALUES(),(),();
INSERT INTO db.tbl VALUES(),(),();
INSERT INTO db.tbl VALUES(),(),();
COMMIT;

Import data by using a JDBC program

public static void main(String[] args) throws Exception {
    // The number of INSERT statements that are used to import data in a batch. 
    int insertNum = 10;
    // The number of data entries to be imported in a batch. 
    int batchSize = 10000;
    String URL="jdbc:mysql://<IP address>:<MySQL port number>/test_db?useLocalSessionState=true";  // The IP address that corresponds to the ID of the virtual private cloud (VPC). You can log on to the VPC console to view the IP address that corresponds to the ID of the VPC in the VPC list. 
    Connection connection = DriverManager.getConnection(URL, "admin", "password");  // The database account and password of the ApsaraDB for SelectDB instance. 
    Statement statement = connection.createStatement();
    statement.execute("begin");
    // Append multiple INSERT statements. 
    for (int num = 0; num < insertNum; num++) {
        StringBuilder sql = new StringBuilder();
        sql.append("Insert Into test_tbl values ");
        for(int i = 0; i < batchSize; i++){
            if(i > 0){
                sql.append(",");
            }
            // Append a row of data that contains fields such as name and age. You can modify the fields that a row of data contains based on your business requirements. 
            sql.append("('zhangsan',18)");
        }
        //add sql to batch: Insert Into tbl values(),(),()
        statement.addBatch(sql.toString());
    }
    statement.addBatch("commit");
    statement.executeBatch();
    // Close the resources. 
    statement.close();
    connection.close();
}

Best practices

Returned results
An INSERT INTO operation is a synchronous operation. The insert operation is complete if results are returned. You need to perform operations based on the returned results.
1. The INSERT INTO operation is successful and the result set is empty.
  If the result set of the SELECT clause in the INSERT INTO statement is empty, a result similar to the following output is returned:
```
INSERT INTO tbl1 SELECT * FROM empty_tbl;
Query OK, 0 rows affected (0.02 sec)
```
  Query OK indicates that the operation is successful. 0 rows affected indicates that no data is imported.
2. The INSERT INTO operation is successful and the result set is not empty.
  In the case where the result set is not empty, a result similar to one of the following outputs is returned:
```
INSERT INTO tbl1 SELECT * FROM tbl2;
Query OK, 4 rows affected (0.38 sec)
{'label':'insert_8510c568-9eda-****-9e36-6adc7d35291c', 'status':'visible', 'txnId':'4005'}
 
INSERT INTO tbl1 with label my_label1 SELECT * FROM tbl2;
Query OK, 4 rows affected (0.38 sec)
{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
 
INSERT INTO tbl1 SELECT * FROM tbl2;
Query OK, 2 rows affected, 2 warnings (0.31 sec)
{'label':'insert_f0747f0e-7a35-****-affa-13a235f4020d', 'status':'visible', 'txnId':'4005'}
 
INSERT INTO tbl1 SELECT * FROM tbl2;
Query OK, 2 rows affected, 2 warnings (0.31 sec)
{'label':'insert_f0747f0e-7a35-****-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'}
```
  Query OK indicates that the operation is successful. 4 rows affected indicates that four rows of data are imported. 2 warnings indicates that two rows of data are filtered out. A JSON string is also returned.
```
{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
{'label':'insert_f0747f0e-7a35-****-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'}
{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'}
```
  label indicates the label that you specified or that is automatically generated. The label is used to identify the job for importing data by using the INSERT INTO statement. Each data import job has a unique label within a single database. status indicates whether the imported data is visible. If the imported data is visible, visible is returned for the status parameter. If the imported data is invisible, committed is returned for the status parameter. txnId indicates the ID of the import transaction that corresponds to the insert operation. err indicates some unexpected errors.
  To query the rows that are filtered out, execute the following statement:
```
SHOW LOAD WHERE label="xxx";
```
  The URL in the returned results can be used to query the error data. For more information, see the error row summary in this topic. Data may be temporarily invisible. However, this batch of data is eventually visible. You can execute the following statement to check whether this batch of data is visible:
```
SHOW TRANSACTION WHERE id=4005;
```
  If visible is displayed in the TransactionStatus column in the returned results, the data is visible.
3. The INSERT INTO operation fails.
  If the INSERT INTO operation fails, no data is imported and a result similar to the following output is returned:
```
INSERT INTO tbl1 SELECT * FROM tbl2 WHERE k1 = "a";
ERROR 1064 (HY000): all partitions have no load data. url: http://10.74.167.16:8042/api/_load_error_log?file=__shard_2/error_log_insert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de8507c0bf8a2
```
  ERROR 1064 (HY000): all partitions have no load data indicates the cause of the failure. The URL in the returned result can be used to query the error data:
```
SHOW LOAD WARNINGS ON "url";
```
Timeout period
The timeout period of the INSERT INTO operation is specified by the query_timeout session variable. The default timeout period of the INSERT INTO operation is 5 minutes. If an import job is not completed within the specified timeout period, the system cancels the job.
Label and atomicity
The INSERT INTO operation also ensures the atomicity of import jobs. If you want to use common table expressions (CTEs) to define subqueries in an INSERT INTO statement, you must specify WITH LABEL and column.
Filtering threshold
If you import data by using an INSERT INTO statement, you cannot specify a filtering threshold by specifying the max_filter_ratio parameter. The default filtering threshold is 1, which indicates that all error rows can be ignored.
In the business scenarios in which data cannot be filtered out, you can set the enable_insert_strict session variable to true. This ensures that the INSERT INTO statement is not successfully executed when data is filtered out.
Performance issues
We recommend that you do not use the INSERT INTO VALUES statement to import data, especially in an online production environment of big data. If you must use the INSERT INTO VALUES statement to import data, merge multiple rows of data into one INSERT INTO statement to import data in batches. We recommend that you import 1,000 to 1,000,000 data entries in a single batch.
Update of specific columns
By default, an INSERT INTO statement writes data by rows. In the merge-on-write (MOW) implementation method of the Unique data model, you can configure the following session variables to update specific columns based on your business requirements.
```
set enable_unique_key_partial_update=true
```
By default, the enable_insert_strict session variable is set to true, which indicates that the strict mode is enabled for INSERT INTO statements. In the strict mode, keys that do not exist cannot be updated. Therefore, when you use an INSERT INTO statement to update specific columns, if you want to insert a key that does not exist, you must set the enable_insert_strict session variable to false.