All Products
Search
Document Center

MaxCompute:Write data to OSS

Last Updated:Nov 13, 2024

MaxCompute allows you to execute INSERT statements to write data from a MaxCompute project to an Object Storage Service (OSS) directory by using OSS external tables. This topic describes how to write data to OSS and provides examples of data write operations.

Background information

You can write data in an internal table of MaxCompute to OSS. After you read data from an OSS external table and process the data, you can write the processed data back to OSS.

  • Write data to OSS by using a built-in text extractor or a built-in open source data extractor.

    If you need to write data in the CSV or TSV format, an open source format, or a compression format that is supported by MaxCompute to OSS, you can create an OSS external table by using a built-in text extractor or a built-in open source data extractor of MaxCompute and execute an INSERT statement to write data to OSS by using the OSS external table.

  • Write data to OSS by using a custom extractor.

    If you need to write data in a custom format to OSS, you can create an OSS external table by using a custom extractor and execute an INSERT statement to write data to OSS by using the OSS external table.

    For more information, see Example: Write data to OSS by using a custom extractor.

  • Write data to OSS by using the multipart upload feature.

    If you need to write data in an open source format to OSS, you can create an OSS external table by using a built-in open source data extractor, enable or disable the multipart upload feature, and execute an INSERT statement to write data to OSS by using the OSS external table. To enable or disable the multipart upload feature, you must configure the odps.sql.unstructured.oss.commit.mode parameter at the session or project level. By default, this feature is disabled. For more information about the multipart upload feature, see Multipart upload.

    Valid values of the odps.sql.unstructured.oss.commit.mode parameter:

    • False: Data that is written from MaxCompute to an OSS external table is stored in the .odps folder in the directory specified by LOCATION. A .meta file is included in the .odps folder to ensure that the data written to OSS is consistent with the data in MaxCompute. Only MaxCompute can correctly process the data in the .odps folder. If another data processing engine parses the data in this folder, an error is returned.

    • True: MaxCompute uses the multipart upload feature to ensure data consistency in two-phase commit mode. In this case, the .odps folder and the .meta file are not generated. Other data processing engines can parse the data that is written from MaxCompute to OSS.

    You can configure the odps.sql.unstructured.oss.commit.mode parameter at the project level and session level.

    • Configure the odps.sql.unstructured.oss.commit.mode parameter at the project level. The configuration takes effect on the project. Sample command:

      setproject odps.sql.unstructured.oss.commit.mode =true;
    • Configure the odps.sql.unstructured.oss.commit.mode parameter at the session level. The configuration takes effect only on the current task. Sample command:

      set odps.sql.unstructured.oss.commit.mode =true;
    Important

    In extreme cases, if a job that executes the INSERT OVERWRITE statement fails, historical data is deleted but new data is not written to the destination table.

    Cause: A hardware failure occurs or the metadata fails to be updated. This issue rarely occurs. If this issue occurs, new data cannot be written to the destination table, and the deleted historical data cannot be restored because the delete operation in OSS cannot be rolled back.

    Solutions:

    • If you want to use historical data to overwrite the data of an OSS external table, you must back up the OSS data. This way, you can use the backup data to overwrite the OSS external table when the job fails. For example, you can execute the insert overwrite table T select * from table T; statement to overwrite table T by using the historical data of table T.

    • If a job that executes the INSERT OVERWRITE statement can be repeatedly submitted, you can submit the job again when the job fails.

Precautions

The bandwidth provided by OSS is limited. If the traffic generated in a short period of time exceeds the upper limit of the bandwidth allowed by the OSS instance due to frequent data reading from or writing to OSS, the bandwidth is used up. In this case, the speed at which data is read from or written to the OSS external table is decreased. For more information about the bandwidth occupied by OSS, see Does OSS throttle bandwidth and QPS?

Prerequisites

Before you write data to OSS, make sure that the following requirements are met:

  • The Alibaba Cloud account or RAM user is granted access permissions on OSS.

    For more information about authorization, see STS authorization for OSS.

  • (Optional) An OSS bucket, OSS directories, and OSS data files are prepared.

    Note

    MaxCompute can automatically create an OSS directory in the OSS console. For an SQL statement that is used to create an external table and a user-defined function (UDF), you can execute the SQL statement to read data from or write data to the external table and UDF. You can also manually create an OSS directory.

    For more information about how to create an OSS bucket, how to create an OSS directory, and how to upload data files to an OSS bucket, see Create a bucket, Create directories, and Simple upload.

Example: Write data to a non-partitioned OSS external by using a built-in text extractor

Read data from the OSS external table mc_oss_csv_external1 that is created in Example: Create a non-partitioned table as an OSS external table by using a built-in text extractor and write the data in the CSV format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/output/.

Perform the following steps:

  1. Log on to the MaxCompute client and execute the following statement to create an OSS external table. The OSS external table is mapped to the OSS directory to which you want to write data.

    create external table if not exists mc_oss_csv_external4
    (
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string,
    direction string
    )
    stored by 'com.aliyun.odps.CsvStorageHandler' 
    with serdeproperties (
     'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
    ) 
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/output/';                       

    For more information about the syntax that is used to create an OSS external table and the parameters that you need to configure when you create an OSS external table, see Create an OSS external table.

  2. Execute the INSERT OVERWRITE or INSERT INTO statement on the MaxCompute client to write data to the OSS external table that you created.

    insert into table mc_oss_csv_external4 select * from mc_oss_csv_external1;

    If the statement is successfully executed, you can view the exported files in the OSS directory. A folder named .odps is generated in the output folder. The .odps folder contains a .meta file and a folder where .csv files are saved. Example of a file in the OSS directory: output/.odps/20210330*********/R2_1_0_0-0_TableSink1-0-.csv.

    导出结果

    Note
    • The .meta file in the .odps folder is an extra macro data file that is generated by MaxCompute. This file records valid data in the .odps folder. If the INSERT statement is successfully executed, all data in the .odps folder is valid. MaxCompute parses the .meta file only when the job that executes the INSERT statement fails. If the INSERT OVERWRITE statement is used and the job that executes the statement fails or is terminated, execute the INSERT OVERWRITE statement again.

    • If you use a built-in extractor of MaxCompute to process TSV or CSV files, the number of files generated in the OSS directory is the same as the number of concurrent SQL jobs.

    • If you execute the insert overwrite ... select ... from ...; statement and 1,000 mappers are allocated on the source table specified by from_tablename, 1,000 TSV or CSV files will be generated.

    • You can use flexible semantics and configurations of MaxCompute to limit the number of files that can be generated. If an outputer runs in a mapper, you can change the value of the odps.stage.mapper.split.size parameter to determine the number of files that can be generated and the number of concurrent mappers. If an outputer runs in a reducer, you can change the value of the odps.stage.reducer.num parameter to adjust the number of files that can be generated. If an outputer runs in a joiner, you can change the value of the odps.stage.joiner.num parameter to adjust the number of files that can be generated.

Example: Write data to a partitioned OSS external table by using a built-in text extractor

Read data from the OSS external table mc_oss_csv_external2 that is created in Example: Create a partitioned table as an OSS external table by using a built-in text extractor and write the data in the CSV format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/output/.

Perform the following steps:

  1. Log on to the MaxCompute client and execute the following statement to create an OSS external table. The OSS external table is mapped to the OSS directory to which you want to write data.

    create external table if not exists mc_oss_csv_external5
    (
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string
    )
    partitioned by (
    direction string
    )
    stored by 'com.aliyun.odps.CsvStorageHandler' 
    with serdeproperties (
     'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
    ) 
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/output/';                      

    For more information about the syntax that is used to create an OSS external table and the parameters that you need to configure when you create an OSS external table, see Create an OSS external table.

  2. Execute the INSERT OVERWRITE or INSERT INTO statement on the MaxCompute client to write data to the OSS external table that you created.

    insert into table mc_oss_csv_external5 partition (direction) select * from mc_oss_csv_external2;

    If the statement is successfully executed, you can view the exported files in the OSS directory. Subdirectories that are mapped to the partitions specified in the INSERT statement are generated in the output folder. A folder named .odps is generated in each subdirectory. The .odps folder contains the .meta file and a folder where .csv files are saved. Example of a file in the OSS directory: output/direction=N/.odps/20210330*********/R2_1_0_0-0_TableSink1-0-.csv.

    导出结果

    Note
    • The .meta file in the .odps folder is an extra macro data file that is generated by MaxCompute. This file records valid data in the .odps folder. If the INSERT statement is successfully executed, all data in the .odps folder is valid. MaxCompute parses the .meta file only when the job that executes the INSERT statement fails. If the INSERT OVERWRITE statement is used and the job that executes the statement fails or is terminated, execute the INSERT OVERWRITE statement again.

    • If you use a built-in extractor of MaxCompute to process TSV or CSV files, the number of files generated in the OSS directory is the same as the number of concurrent SQL jobs.

    • If you execute the insert overwrite ... select ... from ...; statement and 1,000 mappers are allocated on the source table specified by from_tablename, 1,000 TSV or CSV files will be generated.

    • You can use flexible semantics and configurations of MaxCompute to limit the number of files that can be generated. If an outputer runs in a mapper, you can change the value of the odps.stage.mapper.split.size parameter to determine the number of files that can be generated and the number of concurrent mappers. If an outputer runs in a reducer, you can change the value of the odps.stage.reducer.num parameter to adjust the number of files that can be generated. If an outputer runs in a joiner, you can change the value of the odps.stage.joiner.num parameter to adjust the number of files that can be generated.

Example: Write data in a compression format to OSS by using a built-in text extractor

Read data from the OSS external table mc_oss_csv_external1 that is created in Example: Create a non-partitioned table as an OSS external table by using a built-in text extractor and write the data in the GZIP format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/output/.

Perform the following steps:

  1. Log on to the MaxCompute client and execute the following statement to create an OSS external table. The OSS external table is mapped to the OSS directory to which you want to write data.

    create external table if not exists mc_oss_csv_external6
    (
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string,
    direction string
    )
    stored by 'com.aliyun.odps.CsvStorageHandler' 
    with serdeproperties (
     'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole',
     'odps.text.option.gzip.input.enabled'='true',
     'odps.text.option.gzip.output.enabled'='true' 
    ) 
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/output/';

    For more information about the syntax that is used to create an OSS external table and the parameters that you need to configure when you create an OSS external table, see Create an OSS external table.

  2. Execute the INSERT OVERWRITE or INSERT INTO statement on the MaxCompute client to write data to the OSS external table that you created.

    insert into table mc_oss_csv_external6 select * from mc_oss_csv_external1;

    If the statement is successfully executed, you can view the exported files in the OSS directory. A folder named .odps is generated in the output folder. The .odps folder contains the .meta file and a folder where .gz files are saved. Example of a file in the OSS directory: output/.odps/20220413*********/M1_0_0-0_TableSink1-0-.csv.gz.

    导出结果

    Note
    • The .meta file in the .odps folder is an extra macro data file that is generated by MaxCompute. This file records valid data in the .odps folder. If the INSERT statement is successfully executed, all data in the .odps folder is valid. MaxCompute parses the .meta file only when the job that executes the INSERT statement fails. If the INSERT OVERWRITE statement is used and the job that executes the statement fails or is terminated, execute the INSERT OVERWRITE statement again.

    • If you use a built-in extractor of MaxCompute to process TSV or CSV files, the number of files generated in the OSS directory is the same as the number of concurrent SQL jobs.

    • If you execute the insert overwrite ... select ... from ...; statement and 1,000 mappers are allocated on the source table specified by from_tablename, 1,000 TSV or CSV files will be generated.

    • You can use flexible semantics and configurations of MaxCompute to limit the number of files that can be generated. If an outputer runs in a mapper, you can change the value of the odps.stage.mapper.split.size parameter to determine the number of files that can be generated and the number of concurrent mappers. If an outputer runs in a reducer, you can change the value of the odps.stage.reducer.num parameter to adjust the number of files that can be generated. If an outputer runs in a joiner, you can change the value of the odps.stage.joiner.num parameter to adjust the number of files that can be generated.

Example: Write data to OSS by using a built-in open source data extractor

Read data from the OSS external table mc_oss_csv_external1 that is created in Example: Create a non-partitioned table as an OSS external table by using a built-in text extractor and write the data in the ORC format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo4/output/.

Perform the following steps:

  1. Log on to the MaxCompute client and execute the following statement to create an OSS external table. The OSS external table is mapped to the OSS directory to which you want to write data.

    create external table if not exists mc_oss_orc_external
    (
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string,
    direction string
    )
    stored as orc  
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo4/output/';

    For more information about the syntax that is used to create an OSS external table and the parameters that you need to configure when you create an OSS external table, see Create an OSS external table.

  2. Execute the INSERT OVERWRITE or INSERT INTO statement on the MaxCompute client to write data to the OSS external table that you created.

    insert into table mc_oss_orc_external select * from mc_oss_csv_external1;

    If the statement is successfully executed, you can view the exported files in the OSS directory. A folder named .odps is generated in the output folder. The .odps folder contains the .meta file and a folder where ORC files are saved. Example of a file in the OSS directory: output/.odps/20220413*********/M1_0_0-0_TableSink1.

    导出结果

    Note
    • The .meta file in the .odps folder is an extra macro data file that is generated by MaxCompute. This file records valid data in the .odps folder. If the INSERT statement is successfully executed, all data in the .odps folder is valid. MaxCompute parses the .meta file only when the job that executes the INSERT statement fails. If the INSERT OVERWRITE statement is used and the job that executes the statement fails or is terminated, execute the INSERT OVERWRITE statement again.

    • If you use a built-in extractor of MaxCompute to process TSV or CSV files, the number of files generated in the OSS directory is the same as the number of concurrent SQL jobs.

    • If you execute the insert overwrite ... select ... from ...; statement and 1,000 mappers are allocated on the source table specified by from_tablename, 1,000 TSV or CSV files will be generated.

    • You can use flexible semantics and configurations of MaxCompute to limit the number of files that can be generated. If an outputer runs in a mapper, you can change the value of the odps.stage.mapper.split.size parameter to determine the number of files that can be generated and the number of concurrent mappers. If an outputer runs in a reducer, you can change the value of the odps.stage.reducer.num parameter to adjust the number of files that can be generated. If an outputer runs in a joiner, you can change the value of the odps.stage.joiner.num parameter to adjust the number of files that can be generated.

Example: Write data to OSS by using a custom extractor

This example shows how to write data from MaxCompute to an OSS directory by using the custom extractor that is developed in Example: Create an OSS external table by using a custom extractor.

Perform the following steps:

  1. Log on to the MaxCompute client and execute the following statement to create an OSS external table. The OSS external table is mapped to the OSS directory to which you want to write data.

    create external table if not exists mc_oss_external 
    (
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string,
    direction string
    )
    stored by 'com.aliyun.odps.udf.example.text.TextStorageHandler' 
      with serdeproperties (
    'delimiter'='|',  
    'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
    )
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/SampleData/output'
    using 'javatest-1.0-SNAPSHOT.jar'; 

    For more information about the syntax that is used to create an OSS external table and the parameters that you need to configure when you create an OSS external table, see Create an OSS external table.

  2. Execute the INSERT OVERWRITE or INSERT INTO statement on the MaxCompute client to write data to the OSS external table that you created.

    insert into table mc_oss_external select * from mc_oss_csv_external1;

    If the statement is successfully executed, you can view the exported files in the OSS directory. A folder named .odps is generated in the output folder. The .odps folder contains the .meta file and a folder where TXT files are saved. Example of a file in the OSS directory: output/.odps/20220413*********/M1_0_0-0_TableSink1-0.

    导出结果

    Note
    • The .meta file in the .odps folder is an extra macro data file that is generated by MaxCompute. This file records valid data in the .odps folder. If the INSERT statement is successfully executed, all data in the .odps folder is valid. MaxCompute parses the .meta file only when the job that executes the INSERT statement fails. If the INSERT OVERWRITE statement is used and the job that executes the statement fails or is terminated, execute the INSERT OVERWRITE statement again.

    • If you use a built-in extractor of MaxCompute to process TSV or CSV files, the number of files generated in the OSS directory is the same as the number of concurrent SQL jobs.

    • If you execute the insert overwrite ... select ... from ...; statement and 1,000 mappers are allocated on the source table specified by from_tablename, 1,000 TSV or CSV files will be generated.

    • You can use flexible semantics and configurations of MaxCompute to limit the number of files that can be generated. If an outputer runs in a mapper, you can change the value of the odps.stage.mapper.split.size parameter to determine the number of files that can be generated and the number of concurrent mappers. If an outputer runs in a reducer, you can change the value of the odps.stage.reducer.num parameter to adjust the number of files that can be generated. If an outputer runs in a joiner, you can change the value of the odps.stage.joiner.num parameter to adjust the number of files that can be generated.