Import data to an ApsaraDB for SelectDB instance by using OSS - ApsaraDB for SelectDB

ApsaraDB for SelectDB works with Object Storage Service (OSS) to provide the high-speed import feature over an internal network. You can use this feature to import hundreds of GB of data at a time. This topic describes how to use OSS to import data to an ApsaraDB for SelectDB instance.

Prerequisites

An AccessKey pair is obtained. For more information, see Create an AccessKey pair.
OSS is activated and an OSS bucket is created. The OSS bucket resides in the same region as the ApsaraDB for SelectDB instance to which you want to import data. For more information, see Get started by using the OSS console.
The OSS bucket resides in the same region as the ApsaraDB for SelectDB instance. OSS can be accessed over a virtual private cloud (VPC).

Syntax for importing data

LOAD LABEL load_label
(
    data_desc1[, data_desc2, ...]
)
WITH S3
(
    "AWS_PROVIDER" = "OSS",
    "AWS_REGION" = "REGION",  
    "AWS_ENDPOINT" = "ENDPOINT",
    "AWS_ACCESS_KEY" = "AccessKey ID",
    "AWS_SECRET_KEY"="AccessKey Secret"
)
PROPERTIES 
(
  "key1"="value1", ...
);

Note

ApsaraDB for SelectDB allows you to import data from all object storage systems that are compatible with the Amazon Simple Storage Service (Amazon S3) protocol. Therefore, the syntax for importing data uses keywords such as AWS and S3.

Parameters

The data_desc1 parameter is used to describe a group of files to be imported. The following sample code provides the complete definition of the parameter.

[MERGE|APPEND|DELETE]
DATA INFILE
(
"file_path1"[, file_path2, ...]
)
[NEGATIVE]
INTO TABLE `table_name`
[PARTITION (p1, p2, ...)]
[COLUMNS TERMINATED BY "column_separator"]
[FORMAT AS "file_type"]
[(column_list)]
[COLUMNS FROM PATH AS (c1, c2, ...)]
[PRECEDING FILTER predicate]
[SET (column_mapping)]
[WHERE predicate]
[DELETE ON expr]
[ORDER BY source_sequence]
[PROPERTIES ("key1"="value1", ...)]

Parameter	Description
`[MERGE\|APPEND\|DELETE]`	The type of data merging. By default, this parameter is set to `APPEND`, which appends new data to the existing data. You can set this parameter to `MERGE` or `DELETE` only if a table in the Unique key model is used. If you set this parameter to `MERGE`, you must use the `DELETE ON` statement to add a delete_flag column. If you set this parameter to `DELETE`, the data specified in this import job is deleted.
`DATA INFILE`	The path of the file to be imported. You can specify multiple values for this parameter. Wildcards are supported. The path that you specify must be a path of a file. If the path is a path of a directory, the import fails.
`NEGTIVE`	The data of this batch is imported in a negative method. This method is applicable only to a table that is aggregated by using the SUM method and contains integers. In this method, the integers that correspond to the columns aggregated in the imported data by using the SUM method are reversed to offset the error data that was imported.
`PARTITION(p1, p2, ...)`	The partitions of the table to be imported. Data that is not in the specified partitions is ignored.
`COLUMNS TERMINATED BY`	The column delimiter. The column delimiter is valid only in files of the `CSV` format and must be single-byte.
`FORMAT AS`	The format of the file to be imported. Valid values: `CSV`, `PARQUET` and, `ORC`. Default value: `CSV`.
`column list`	The order in which the columns are sorted in the source file. For more information, see Converting Source Data.
`COLUMNS FROM PATH AS`	The columns that are extracted from the file to be imported.
`PRECEDING FILTER predicate`	The preset conditions for filtering data. The data is merged into source data rows in sequence based on the values of the `column list` and `COLUMNS FROM PATH AS` parameters. Then, the data is filtered based on the preset conditions for filtering data. For more information, see Converting Source Data.
`SET (column_mapping)`	The function that is used to convert the specified column.
`WHERE predicate`	The conditions based on which the imported data is filtered. For more information, see Converting Source Data.
`DELETE ON expr`	This parameter must be used together with the `MEREGE` type of data merging and is applicable only to tables in the Unique key model. This parameter specifies the delete_flag column and the calculation relationship in the data to be imported.
`ORDER BY`	The sequence_col column in the data to be imported. This parameter is applicable only to tables in the Unique key model and ensures the data sequence when data is imported.
`PROPERTIES ("key1"="value1", ...)`	The parameters that specify the format of the file to be imported. If the file to be imported is in the `JSON` format, you can specify parameters such as `json_root`, `jsonpaths`, and `fuzzy_parse`.

The following table describes the parameters that are used to configure the cloud storage service.

Parameter	Description
`AWS_PROVIDER`	The service provider of the object storage service, which is fixed to OSS.
`AWS_REGION`	The region in which the OSS data to be accessed resides.
`AWS_ENDPOINT`	The `endpoint` that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints. Important Make sure that the OSS bucket and your ApsaraDB for SelectDB instance reside in the same region.
`AWS_ACCESS_KEY`	The `AccessKey ID` that is used to access OSS data.
`AWS_SECRET_KEY`	The `AccessKey secret` that is used to access OSS data.

The following table describes the parameters of PROPERTIES.

Parameter	Description
`timeout`	The timeout period for the import job. Unit: seconds. Default value: `14400`, which indicates 4 hours.
`max_filter_ratio`	The maximum ratio of data that can be filtered out. For example, data is filtered out because it does not conform to standards. Default value: `0`. Valid values: 0 to 1.
`exec_mem_limit`	The maximum size of memory that is available for the import job. Unit: bytes. Default value: `2147483648`, which indicates 2 GiB.
`strict_mode`	Specifies whether to enable the strict mode for the import job. Default value: `false`.
`timezone`	The time zone of the time zone-related functions, such as `strftime, alignment_timestamp, and from_unixtime`. For more information, see List of all time zones. Default value: `Asia/Shanghai`.
`load_parallelism`	The maximum number of import jobs that can concurrently run at a time. Default value: `1`. You can set this parameter to a greater value to run multiple import jobs at the same time to accelerate data import.
`send_batch_parallelism`	The maximum number of concurrent jobs to send data for processing in batches. If the value of this parameter is greater than the value of the `max_send_batch_parallelism_per_job` parameter of the backend (BE) configurations, the BE uses the value of the `max_send_batch_parallelism_per_job` parameter.
`load_to_single_tablet`	Specifies whether to import data to only one tablet of the corresponding partition. Default value: `false`. This parameter is available only if you import data to a table that contains random partitions in the Duplicate key model.

Examples

Create an ApsaraDB for SelectDB table to which you want to import data. Sample code:

CREATE TABLE test_table
(
    id int,
    name varchar(50),
    age int,
    address varchar(50),
    url varchar(500)
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

Create a file to be imported, which is named test_file.txt. Sample code:

1,yang,32,shanghai,http://example.com
2,wang,22,beijing,http://example.com
3,xiao,23,shenzhen,http://example.com
4,jess,45,hangzhou,http://example.com
5,jack,14,shanghai,http://example.com
6,tomy,25,hangzhou,http://example.com
7,lucy,45,shanghai,http://example.com
8,tengyin,26,shanghai,http://example.com
9,wangli,27,shenzhen,http://example.com
10,xiaohua,37,shanghai,http://example.com

Import data by using OSS. Sample code:

LOAD LABEL test_db.test_label_1
(
    DATA INFILE("s3://your_bucket_name/test_file.txt")
    INTO TABLE test_table
    COLUMNS TERMINATED BY ","
)
WITH S3
(
    "AWS_PROVIDER" = "OSS",
    "AWS_REGION" = "oss-cn-beijing",  
    "AWS_ENDPOINT" = "oss-cn-beijing-internal.aliyuncs.com",
    "AWS_ACCESS_KEY" = "<your_access_key>",
    "AWS_SECRET_KEY"="<your_secret_key>"
)
PROPERTIES
(
    "timeout" = "3600"
);

Note

The path of the OSS bucket must start with s3://.