All Products
Search
Document Center

ApsaraDB for SelectDB:Import data by using OSS

Last Updated:Jun 07, 2024

ApsaraDB for SelectDB works with Object Storage Service (OSS) to provide the high-speed import feature over an internal network. You can use this feature to import hundreds of GB of data at a time. This topic describes how to use OSS to import data to an ApsaraDB for SelectDB instance.

Prerequisites

  • An AccessKey pair is obtained. For more information, see Create an AccessKey pair.

  • OSS is activated and an OSS bucket is created. The OSS bucket resides in the same region as the ApsaraDB for SelectDB instance to which you want to import data. For more information, see Get started by using the OSS console.

  • The OSS bucket resides in the same region as the ApsaraDB for SelectDB instance. OSS can be accessed over a virtual private cloud (VPC).

Syntax for importing data

LOAD LABEL load_label
(
    data_desc1[, data_desc2, ...]
)
WITH S3
(
    "AWS_PROVIDER" = "OSS",
    "AWS_REGION" = "REGION",  
    "AWS_ENDPOINT" = "ENDPOINT",
    "AWS_ACCESS_KEY" = "AccessKey ID",
    "AWS_SECRET_KEY"="AccessKey Secret"
)
PROPERTIES 
(
  "key1"="value1", ...
);
Note

ApsaraDB for SelectDB allows you to import data from all object storage systems that are compatible with the Amazon Simple Storage Service (Amazon S3) protocol. Therefore, the syntax for importing data uses keywords such as AWS and S3.

Parameters

The data_desc1 parameter is used to describe a group of files to be imported. The following sample code provides the complete definition of the parameter.

[MERGE|APPEND|DELETE]
DATA INFILE
(
"file_path1"[, file_path2, ...]
)
[NEGATIVE]
INTO TABLE `table_name`
[PARTITION (p1, p2, ...)]
[COLUMNS TERMINATED BY "column_separator"]
[FORMAT AS "file_type"]
[(column_list)]
[COLUMNS FROM PATH AS (c1, c2, ...)]
[PRECEDING FILTER predicate]
[SET (column_mapping)]
[WHERE predicate]
[DELETE ON expr]
[ORDER BY source_sequence]
[PROPERTIES ("key1"="value1", ...)]

Parameter

Description

[MERGE|APPEND|DELETE]

The type of data merging. By default, this parameter is set to APPEND, which appends new data to the existing data. You can set this parameter to MERGE or DELETE only if a table in the Unique key model is used. If you set this parameter to MERGE, you must use the DELETE ON statement to add a delete_flag column. If you set this parameter to DELETE, the data specified in this import job is deleted.

DATA INFILE

The path of the file to be imported.

You can specify multiple values for this parameter. Wildcards are supported. The path that you specify must be a path of a file. If the path is a path of a directory, the import fails.

NEGTIVE

The data of this batch is imported in a negative method. This method is applicable only to a table that is aggregated by using the SUM method and contains integers. In this method, the integers that correspond to the columns aggregated in the imported data by using the SUM method are reversed to offset the error data that was imported.

PARTITION(p1, p2, ...)

The partitions of the table to be imported. Data that is not in the specified partitions is ignored.

COLUMNS TERMINATED BY

The column delimiter. The column delimiter is valid only in files of the CSV format and must be single-byte.

FORMAT AS

The format of the file to be imported. Valid values: CSV, PARQUET and, ORC. Default value: CSV.

column list

The order in which the columns are sorted in the source file. For more information, see Converting Source Data.

COLUMNS FROM PATH AS

The columns that are extracted from the file to be imported.

PRECEDING FILTER predicate

The preset conditions for filtering data. The data is merged into source data rows in sequence based on the values of the column list and COLUMNS FROM PATH AS parameters. Then, the data is filtered based on the preset conditions for filtering data. For more information, see Converting Source Data.

SET (column_mapping)

The function that is used to convert the specified column.

WHERE predicate

The conditions based on which the imported data is filtered. For more information, see Converting Source Data.

DELETE ON expr

This parameter must be used together with the MEREGE type of data merging and is applicable only to tables in the Unique key model. This parameter specifies the delete_flag column and the calculation relationship in the data to be imported.

ORDER BY

The sequence_col column in the data to be imported. This parameter is applicable only to tables in the Unique key model and ensures the data sequence when data is imported.

PROPERTIES ("key1"="value1", ...)

The parameters that specify the format of the file to be imported. If the file to be imported is in the JSON format, you can specify parameters such as json_root, jsonpaths, and fuzzy_parse.

The following table describes the parameters that are used to configure the cloud storage service.

Parameter

Description

AWS_PROVIDER

The service provider of the object storage service, which is fixed to OSS.

AWS_REGION

The region in which the OSS data to be accessed resides.

AWS_ENDPOINT

The endpoint that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints.

Important

Make sure that the OSS bucket and your ApsaraDB for SelectDB instance reside in the same region.

AWS_ACCESS_KEY

The AccessKey ID that is used to access OSS data.

AWS_SECRET_KEY

The AccessKey secret that is used to access OSS data.

The following table describes the parameters of PROPERTIES.

Parameter

Description

timeout

The timeout period for the import job. Unit: seconds. Default value: 14400, which indicates 4 hours.

max_filter_ratio

The maximum ratio of data that can be filtered out. For example, data is filtered out because it does not conform to standards. Default value: 0. Valid values: 0 to 1.

exec_mem_limit

The maximum size of memory that is available for the import job. Unit: bytes. Default value: 2147483648, which indicates 2 GiB.

strict_mode

Specifies whether to enable the strict mode for the import job. Default value: false.

timezone

The time zone of the time zone-related functions, such as strftime, alignment_timestamp, and from_unixtime. For more information, see List of all time zones. Default value: Asia/Shanghai.

load_parallelism

The maximum number of import jobs that can concurrently run at a time. Default value: 1. You can set this parameter to a greater value to run multiple import jobs at the same time to accelerate data import.

send_batch_parallelism

The maximum number of concurrent jobs to send data for processing in batches. If the value of this parameter is greater than the value of the max_send_batch_parallelism_per_job parameter of the backend (BE) configurations, the BE uses the value of the max_send_batch_parallelism_per_job parameter.

load_to_single_tablet

Specifies whether to import data to only one tablet of the corresponding partition. Default value: false. This parameter is available only if you import data to a table that contains random partitions in the Duplicate key model.

Examples

  1. Create an ApsaraDB for SelectDB table to which you want to import data. Sample code:

    CREATE TABLE test_table
    (
        id int,
        name varchar(50),
        age int,
        address varchar(50),
        url varchar(500)
    )
    DISTRIBUTED BY HASH(id) BUCKETS 4
    PROPERTIES("replication_num" = "1");
  2. Create a file to be imported, which is named test_file.txt. Sample code:

    1,yang,32,shanghai,http://example.com
    2,wang,22,beijing,http://example.com
    3,xiao,23,shenzhen,http://example.com
    4,jess,45,hangzhou,http://example.com
    5,jack,14,shanghai,http://example.com
    6,tomy,25,hangzhou,http://example.com
    7,lucy,45,shanghai,http://example.com
    8,tengyin,26,shanghai,http://example.com
    9,wangli,27,shenzhen,http://example.com
    10,xiaohua,37,shanghai,http://example.com
  3. Import data by using OSS. Sample code:

    LOAD LABEL test_db.test_label_1
    (
        DATA INFILE("s3://your_bucket_name/test_file.txt")
        INTO TABLE test_table
        COLUMNS TERMINATED BY ","
    )
    WITH S3
    (
        "AWS_PROVIDER" = "OSS",
        "AWS_REGION" = "oss-cn-beijing",  
        "AWS_ENDPOINT" = "oss-cn-beijing-internal.aliyuncs.com",
        "AWS_ACCESS_KEY" = "<your_access_key>",
        "AWS_SECRET_KEY"="<your_secret_key>"
    )
    PROPERTIES
    (
        "timeout" = "3600"
    );
    
    Note

    The path of the OSS bucket must start with s3://.