All Products
Search
Document Center

DataWorks:Upload data

Last Updated:Nov 13, 2024

The data upload feature of DataWorks allows you to upload data, such as on-premises files, DataAnalysis workbooks, and Object Storage Service (OSS) objects, to compute engines, such as MaxCompute, E-MapReduce (EMR) Hive, and Hologres, for analysis and management. The feature provides an easy-to-use data transmission service that helps you quickly implement data-driven business. This topic describes how to use the data upload feature to upload data.

Precautions

If you perform cross-border data transmission operations, such as transmitting data from China to outside China, or transmitting data between different countries or regions, make sure that you understand and comply with relevant compliance declarations beforehand. Otherwise, data may fail to be uploaded, and you may be held legally responsible. For more information, see Appendix: Compliance statement for cross-border data uploads.

Feature description

You can use the data upload feature to upload data in on-premises files, DataWorks DataAnalysis workbooks, and Object Storage Service (OSS) objects to tables of MaxCompute, E-MapReduce (EMR) Hive, and Hologres compute engines. Requirements for uploading data of different sources:

  • On-premises files:

    • You can upload files in the CSV or XLSX format. If you upload a CSV file, the file can be up to 5 GB in size. If you upload an XLSX file, the file can be up to 100 MB in size.

    • By default, only the data of the first sheet in a file is uploaded. If you want to upload the data of multiple sheets in a file, create a separate file for each sheet and make sure that the sheet from which you want to upload data is the first sheet of the created file.

  • OSS objects: You can upload data only from buckets in the same region as the current DataWorks workspace.

Limits

Billing

You are charged the following fees for data uploads:

  • Data transmission fee

  • Computing and storage fees when new tables are created

The fees are included in the bills of the related compute engine service. For information about the billing details, see the following topics about the billing rules of the related compute engine service: MaxCompute billing overview, Hologres billing overview, and E-MapReduce billing overview.

Prerequisites

  • The required data source is added to store the data that you want to upload. Then, you can analyze and manage data in the data source. For information about how to add data sources, see Add a MaxCompute data source, Add a Hive data source, and Add a Hologres data source.

  • Optional. If you want to upload OSS objects, the following conditions must be met:

    • OSS is activated and a bucket is created. The data that you want to upload is stored in the bucket. You can upload OSS objects to the related data source. For more information, see Create a bucket and Upload objects.

    • The Alibaba Cloud account that you want to use to upload data is granted permissions to access the destination bucket. For information about how to grant permissions to an Alibaba Cloud account, see Overview.

  • Optional. If you want to upload workbooks, the following conditions must be met: A workbook is created and data is imported to the workbook in DataAnalysis. For more information, see Create and manage a workbook and Import data to a workbook.

Go to the Upload Data page

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. In the upper-left corner of the DataStudio page, click the image.png icon and choose All Products > Data Integration > Upload and Download.

  3. In the left-side navigation pane of the Upload and Download page, click the image.png icon to go to the Upload Data page.

  4. Click Upload Data and upload the desired data by following the on-screen instructions.

Upload data

DataWorks allows you to upload on-premises files, DataAnalysis workbooks, and OSS objects to MaxCompute, EMR Hive, or Hologres. The upload settings vary based on the type of data that you want to upload.

Upload on-premises files

  1. Select the data that you want to upload.

    • Data Source: Select Local File.

    • Specify Data to Be Uploaded: Click any area in the dotted-line rectangle for the Select File parameter to select an on-premises file or drag an on-premises file to the dotted-line rectangle. Then, configure the Whether To Remove Dirty Data parameter. Valid values:

      • Yes: If dirty data is identified, the platform ignores it and continues to upload data.

      • No: If dirty data is identified, the platform does not ignore it and blocks the data upload.

    Note
    • You can upload files in the CSV or XLSX format. If you upload a CSV file, the file can be up to 5 GB in size. If you upload an XLSX file, the file can be up to 100 MB in size.

    • By default, only the data of the first sheet in a file is uploaded. If you want to upload the data of multiple sheets in a file, create a separate file for each sheet and make sure that the sheet from which you want to upload data is the first sheet of the created file.

    • Dirty data: For example, if the data of a cell in a file is of the string type but is mapped to a destination field of the INT type, the data in the row fails to be written and is identified as dirty data. The specific dirty data is determined based on the actual judgment logic of the platform.

  2. Configure the table in which you want to store the data that you want to upload.

    You can store the data that you want to upload to an existing table or a new table of a specified data source.image.png

    The following table describes the parameters.

    Parameter

    Description

    Compute Engine

    You can upload data only to MaxCompute, EMR Hive, or Hologres.

    MaxComputeProject Name or Data Source

    The project or data source in which you want to store the data that you want to upload. The required parameters vary based on the type of compute engine. You can check the parameters in the DataWorks console.

    Note

    If you set the Compute Engine parameter to EMR HIVE, you can select only a data source that is added in Alibaba Cloud instance mode.

    Projects in the production environment are differentiated from projects in the development environment.

    • If you select a project in the production environment, you can select only a table in the production environment as the destination table.

    • If you select a project in the development environment, you can select only a table in the development environment as the destination table.

    Destination Table (set to Existing Table)

    • Select Destination Table: Select the table in which you want to store the data that you want to upload. You can enter a keyword to search for the desired table.

      Note

      You can upload data only to a table that you own. For more information, see the Limits section in this topic.

    • Upload Method: the method that is used to add data to the destination table. Configure this parameter based on the mappings between source fields and destination fields, which are configured in the next step.

      • If you set the Upload Method parameter to Clear Table Data First, the system clears data in the destination table, and then imports all data to the mapped fields in the destination table.

      • If you set the Upload Method parameter to Append, the data that you want to upload is appended to the mapped fields in the destination table.

    • Policy For Primary Key Conflict: the policy that is used to handle a primary key conflict in the destination table during data upload. Valid values:

      • Ignore: The uploaded data is ignored. The data in the destination table is not updated.

      • Update (replace): The uploaded data overwrites all old data in the destination table. NULL is forcefully written to the fields for which column mappings are not configured.

      • update: The uploaded data overwrites only the field data for which column mappings are configured in the destination table.

      Note

      This parameter is required only for a Hologres compute engine.

    Destination Table (set to Create Table)

    • Table Name: the name of the new table.

    • Table Type: Select Non-partitioned Table or Partitioned Table. If you set the parameter to Partitioned Table, you must specify the partition field and the value of the field.

    • Lifecycle: the validity period of the table. After the validity period elapses, the table may become unavailable. For more information about the table lifecycle, see Lifecycle and Lifecycle management operations.

    Note

    You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.

  3. Preview the data that you want to upload and specify fields in the destination table.

    After you select the data that you want to upload and the destination table to which you want to store the data, you can preview the data details, and configure mappings between fields in the source file and fields in the destination table. Data can be uploaded only after you configure the mappings.

    Note

    You can preview only the first 20 data records.

    image.pngThe following table describes the parameters.

    Parameter

    Description

    Settings for fields in the destination table when Destination Table is set to Existing Table

    You must configure mappings between fields in the data file and fields in the destination table. Data can be uploaded only after you configure the mappings. The mapping methods are Mapping by Column Name and Mapping by Order. You can also configure the name of a mapped field in the destination table.

    Note
    • If no mapping exists between the data that you want to upload and the destination fields, the data is dimmed and not uploaded.

    • One-to-more mappings are not supported.

    • The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.

    Settings for fields in the destination table when Destination Table is set to Create Table

    You can click Intelligent Field Generation to allow the system to fill out the field information. You can also manually modify the field information.

    Note
    • The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.

    • You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.

    File Encoding Format

    If the data that you want to upload to the destination table contains garbled characters, you can switch to other available encoding formats. Valid values: UTF-8, GB18030, and Big5.

    Ignore First Row

    Specifies whether to upload the first row of the data file to the destination table. In most cases, the first row contains column names.

    • If you select the check box, the first row of the file is not uploaded to the destination table.

    • If you do not select the check box, the first row of the file is uploaded to the destination table.

  4. Click Upload Data to upload the data.

Upload DataAnalysis workbooks

  1. Select the data that you want to upload.

    • Data Source: Select Workbook.

    • Specify Data to Be Uploaded: Select a created workbook and configure the Whether To Remove Dirty Data parameter.

      • Yes: If dirty data is identified, the platform ignores it and continues to upload data.

      • No: If dirty data is identified, the platform does not ignore it and blocks the data upload.

    Note
    • For information about how to create a workbook and import data to the workbook, see Create and manage a workbook and Import data to a workbook.

    • Dirty data: For example, if the data of a cell in a file is of the string type but is mapped to a destination field of the INT type, the data in the row fails to be written and is identified as dirty data. The specific dirty data is determined based on the actual judgment logic of the platform.

  2. Configure the table in which you want to store the data that you want to upload.

    You can store the data that you want to upload to an existing table or a new table of a specified data source.image.png

    The following table describes the parameters.

    Parameter

    Description

    Compute Engine

    You can upload data only to MaxCompute, EMR Hive, or Hologres.

    MaxComputeProject Name or Data Source

    The project or data source in which you want to store the data that you want to upload. The required parameters vary based on the type of compute engine. You can check the parameters in the DataWorks console.

    Note

    If you set the Compute Engine parameter to EMR HIVE, you can select only a data source that is added in Alibaba Cloud instance mode.

    Projects in the production environment are differentiated from projects in the development environment.

    • If you select a project in the production environment, you can select only a table in the production environment as the destination table.

    • If you select a project in the development environment, you can select only a table in the development environment as the destination table.

    Destination Table (set to Existing Table)

    • Select Destination Table: Select the table in which you want to store the data that you want to upload. You can enter a keyword to search for the desired table.

      Note

      You can upload data only to a table that you own. For more information, see the Limits section in this topic.

    • Upload Method: the method that is used to add data to the destination table. Configure this parameter based on the mappings between source fields and destination fields, which are configured in the next step.

      • If you set the Upload Method parameter to Clear Table Data First, the system clears data in the destination table, and then imports all data to the mapped fields in the destination table.

      • If you set the Upload Method parameter to Append, the data that you want to upload is appended to the mapped fields in the destination table.

    • Policy For Primary Key Conflict: the policy that is used to handle a primary key conflict in the destination table during data upload. Valid values:

      • Ignore: The uploaded data is ignored. The data in the destination table is not updated.

      • Update (replace): The uploaded data overwrites all old data in the destination table. NULL is forcefully written to the fields for which column mappings are not configured.

      • update: The uploaded data overwrites only the field data for which column mappings are configured in the destination table.

      Note

      This parameter is required only for a Hologres compute engine.

    Destination Table (set to Create Table)

    • Table Name: the name of the new table.

    • Table Type: Select Non-partitioned Table or Partitioned Table. If you set the parameter to Partitioned Table, you must specify the partition field and the value of the field.

    • Lifecycle: the validity period of the table. After the validity period elapses, the table may become unavailable. For more information about the table lifecycle, see Lifecycle and Lifecycle management operations.

    Note

    You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.

  3. Preview the data that you want to upload and specify fields in the destination table.

    After you select the data that you want to upload and the destination table to which you want to store the data, you can preview the data details, and configure mappings between fields in the source file and fields in the destination table. Data can be uploaded only after you configure the mappings.

    Note

    You can preview only the first 20 data records.

    image.pngThe following table describes the parameters.

    Parameter

    Description

    Settings for fields in the destination table when Destination Table is set to Existing Table

    You must configure mappings between fields in the data file and fields in the destination table. Data can be uploaded only after you configure the mappings. The mapping methods are Mapping by Column Name and Mapping by Order. You can also configure the name of a mapped field in the destination table.

    Note
    • If no mapping exists between the data that you want to upload and the destination fields, the data is dimmed and not uploaded.

    • One-to-more mappings are not supported.

    • The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.

    Settings for fields in the destination table when Destination Table is set to Create Table

    You can click Intelligent Field Generation to allow the system to fill out the field information. You can also manually modify the field information.

    Note
    • The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.

    • You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.

    File Encoding Format

    If the data that you want to upload to the destination table contains garbled characters, you can switch to other available encoding formats. Valid values: UTF-8, GB18030, and Big5.

    Ignore First Row

    Specifies whether to upload the first row of the data file to the destination table. In most cases, the first row contains column names.

    • If you select the check box, the first row of the file is not uploaded to the destination table.

    • If you do not select the check box, the first row of the file is uploaded to the destination table.

  4. Click Upload Data and upload the desired data by following the on-screen instructions.

Upload OSS objects

  1. Select the data that you want to upload.

    • Data Source: Select Alibaba Cloud OSS.

    • Specify Data to Be Uploaded: Select an object in the created bucket and configure the Whether To Remove Dirty Data parameter.

      • Yes: If dirty data is identified, the platform ignores it and continues to upload data.

      • No: If dirty data is identified, the platform does not ignore it and blocks the data upload.

    Note
    • You can upload data only from buckets in the same region as the current DataWorks workspace. For information about how to create a bucket, see Create a bucket.

    • Dirty data: For example, if the data of a cell in a file is of the string type but is mapped to a destination field of the INT type, the data in the row fails to be written and is identified as dirty data. The specific dirty data is determined based on the actual judgment logic of the platform.

  2. Configure the table in which you want to store the data that you want to upload.

    You can store the data that you want to upload to an existing table or a new table of a specified data source.image.png

    The following table describes the parameters.

    Parameter

    Description

    Compute Engine

    You can upload data only to MaxCompute, EMR Hive, or Hologres.

    MaxComputeProject Name or Data Source

    The project or data source in which you want to store the data that you want to upload. The required parameters vary based on the type of compute engine. You can check the parameters in the DataWorks console.

    Note

    If you set the Compute Engine parameter to EMR HIVE, you can select only a data source that is added in Alibaba Cloud instance mode.

    Projects in the production environment are differentiated from projects in the development environment.

    • If you select a project in the production environment, you can select only a table in the production environment as the destination table.

    • If you select a project in the development environment, you can select only a table in the development environment as the destination table.

    Destination Table (set to Existing Table)

    • Select Destination Table: Select the table in which you want to store the data that you want to upload. You can enter a keyword to search for the desired table.

      Note

      You can upload data only to a table that you own. For more information, see the Limits section in this topic.

    • Upload Method: the method that is used to add data to the destination table. Configure this parameter based on the mappings between source fields and destination fields, which are configured in the next step.

      • If you set the Upload Method parameter to Clear Table Data First, the system clears data in the destination table, and then imports all data to the mapped fields in the destination table.

      • If you set the Upload Method parameter to Append, the data that you want to upload is appended to the mapped fields in the destination table.

    • Policy For Primary Key Conflict: the policy that is used to handle a primary key conflict in the destination table during data upload. Valid values:

      • Ignore: The uploaded data is ignored. The data in the destination table is not updated.

      • Update (replace): The uploaded data overwrites all old data in the destination table. NULL is forcefully written to the fields for which column mappings are not configured.

      • update: The uploaded data overwrites only the field data for which column mappings are configured in the destination table.

      Note

      This parameter is required only for a Hologres compute engine.

    Destination Table (set to Create Table)

    • Table Name: the name of the new table.

    • Table Type: Select Non-partitioned Table or Partitioned Table. If you set the parameter to Partitioned Table, you must specify the partition field and the value of the field.

    • Lifecycle: the validity period of the table. After the validity period elapses, the table may become unavailable. For more information about the table lifecycle, see Lifecycle and Lifecycle management operations.

    Note

    You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.

  3. Preview the data that you want to upload and specify fields in the destination table.

    After you select the data that you want to upload and the destination table to which you want to store the data, you can preview the data details, and configure mappings between fields in the source file and fields in the destination table. Data can be uploaded only after you configure the mappings.

    Note

    You can preview only the first 20 data records.

    image.pngThe following table describes the parameters.

    Parameter

    Description

    Settings for fields in the destination table when Destination Table is set to Existing Table

    You must configure mappings between fields in the data file and fields in the destination table. Data can be uploaded only after you configure the mappings. The mapping methods are Mapping by Column Name and Mapping by Order. You can also configure the name of a mapped field in the destination table.

    Note
    • If no mapping exists between the data that you want to upload and the destination fields, the data is dimmed and not uploaded.

    • One-to-more mappings are not supported.

    • The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.

    Settings for fields in the destination table when Destination Table is set to Create Table

    You can click Intelligent Field Generation to allow the system to fill out the field information. You can also manually modify the field information.

    Note
    • The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.

    • You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.

    File Encoding Format

    If the data that you want to upload to the destination table contains garbled characters, you can switch to other available encoding formats. Valid values: UTF-8, GB18030, and Big5.

    Ignore First Row

    Specifies whether to upload the first row of the data file to the destination table. In most cases, the first row contains column names.

    • If you select the check box, the first row of the file is not uploaded to the destination table.

    • If you do not select the check box, the first row of the file is uploaded to the destination table.

  4. Click Upload Data to upload the data.

What to do next

After you upload the data, you can perform the following operations based on your business requirements:

  • Data query: You can use DataAnalysis to query and analyze data. For more information, see SQL query.

  • View the details of the uploaded data: On the Data Upload page, you can click the name of the destination table to go to the DataMap page and view the details of the destination table. For more information, see Query and manage common data.

Appendix: Compliance statement for cross-border data uploads

Important

If you perform cross-border data transmission operations, such as transmitting data from China to outside China, or transmitting data between different countries or regions, make sure that you understand and comply with relevant compliance declarations beforehand. Otherwise, data may fail to be uploaded, and you may be held legally responsible.

Your business data in the cloud will be transmitted to the selected region or product deployment area when you perform cross-border data operations. You must make sure that the relevant operations comply with the following requirements:

  • You have the required permissions to process relevant business data in the cloud.

  • You use adequate data security protection technologies and strategies.

  • Data transmission operations comply with relevant laws and regulations. For example, the data to be transmitted does not contain content that is restricted or prohibited from transmission or disclosure by applicable laws.

We recommend that you obtain professional legal or compliance advice before you conduct data upload that may involve cross-border data transmission to ensure compliance with all applicable laws, regulations, and regulatory policies. For example, you can obtain valid permissions from the person who owns the personal information, complete the endorsement and filing of related contract provisions, and fulfill legal duties such as carrying out necessary security evaluations.

If you perform cross-border data operations without adhering to this compliance statement, you shall bear the corresponding legal consequences. You shall also be liable for any resulting losses suffered by Alibaba Cloud and its affiliated companies.

References