The data upload feature of DataWorks allows you to upload data, such as on-premises files, DataAnalysis workbooks, and Object Storage Service (OSS) objects, to compute engines, such as MaxCompute, E-MapReduce (EMR) Hive, and Hologres, for analysis and management. The feature provides an easy-to-use data transmission service that helps you quickly implement data-driven business. This topic describes how to use the data upload feature to upload data.
Precautions
If you perform cross-border data transmission operations, such as transmitting data from China to outside China, or transmitting data between different countries or regions, make sure that you understand and comply with relevant compliance declarations beforehand. Otherwise, data may fail to be uploaded, and you may be held legally responsible. For more information, see Appendix: Compliance statement for cross-border data uploads.
Feature description
You can use the data upload feature to upload data in on-premises files, DataWorks DataAnalysis workbooks, and Object Storage Service (OSS) objects to tables of MaxCompute, E-MapReduce (EMR) Hive, and Hologres compute engines. Requirements for uploading data of different sources:
On-premises files:
You can upload files in the
CSV
orXLSX
format. If you upload aCSV
file, the file can be up to 5 GB in size. If you upload anXLSX
file, the file can be up to 100 MB in size.By default, only the data of the first sheet in a file is uploaded. If you want to upload the data of multiple sheets in a file, create a separate file for each sheet and make sure that the sheet from which you want to upload data is the first sheet of the created file.
OSS objects: You can upload data only from buckets in the same region as the current DataWorks workspace.
Limits
Resource group: You must specify resource groups for scheduling and resource groups for Data Integration for data upload.
Upload data to MaxCompute:
Serverless resource groups and old-version resource groups are supported. Old-version resource groups consist of exclusive resource groups for scheduling and exclusive resource groups for Data Integration. We recommend that you use serverless resource groups. Make sure that the data source used by data upload tasks is connected to the selected resource groups.
The selected exclusive resource groups and serverless resource groups must be associated with the DataWorks workspace where the table that is used to receive data resides.
Upload data to EMR Hive or Hologres:
Only serverless resource groups and exclusive resource groups are supported. Exclusive resource groups consist of exclusive resource groups for scheduling and exclusive resource groups for Data Integration. You must select an exclusive resource group or a serverless resource group for the compute engine type on the System Management page of DataAnalysis.
The selected resource groups must be associated with the DataWorks workspace where the table that is used to receive data resides. Make sure that the data source used by data upload tasks is connected to the selected resource groups.
NoteFor information about how to configure a resource group for a compute engine in DataAnalysis, see System management.
For information about how to establish network connections between a resource group and a data source, see Establish a network connection between a resource group and a data source.
For information about how to associate an exclusive resource group with a workspace, see Create and use an exclusive resource group for scheduling and Create and use an exclusive resource group for Data Integration.
Table: You can upload data only to a table that you own. You can use one of the following methods to determine whether you are the owner of a table:
If Table Owner is displayed on the details page of a table in Data Map, you are the owner of the table. For information about how to view the details of a table, see the View the details of a table section in the "MaxCompute table data" topic.
If you create a table to store the uploaded data, you are the owner of the table. For more information, see the Upload data to a created table section in this topic.
Billing
You are charged the following fees for data uploads:
Data transmission fee
Computing and storage fees when new tables are created
The fees are included in the bills of the related compute engine service. For information about the billing details, see the following topics about the billing rules of the related compute engine service: MaxCompute billing overview, Hologres billing overview, and E-MapReduce billing overview.
Prerequisites
The required data source is added to store the data that you want to upload. Then, you can analyze and manage data in the data source. For information about how to add data sources, see Add a MaxCompute data source, Add a Hive data source, and Add a Hologres data source.
Optional. If you want to upload OSS objects, the following conditions must be met:
OSS is activated and a bucket is created. The data that you want to upload is stored in the bucket. You can upload OSS objects to the related data source. For more information, see Create a bucket and Upload objects.
The Alibaba Cloud account that you want to use to upload data is granted permissions to access the destination bucket. For information about how to grant permissions to an Alibaba Cloud account, see Overview.
Optional. If you want to upload workbooks, the following conditions must be met: A workbook is created and data is imported to the workbook in DataAnalysis. For more information, see Create and manage a workbook and Import data to a workbook.
Go to the Upload Data page
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the upper-left corner of the DataStudio page, click the icon and choose
.In the left-side navigation pane of the Upload and Download page, click the icon to go to the Upload Data page.
Click Upload Data and upload the desired data by following the on-screen instructions.
Upload data
DataWorks allows you to upload on-premises files, DataAnalysis workbooks, and OSS objects to MaxCompute, EMR Hive, or Hologres. The upload settings vary based on the type of data that you want to upload.
Upload on-premises files
Select the data that you want to upload.
Data Source: Select Local File.
Specify Data to Be Uploaded: Click any area in the dotted-line rectangle for the Select File parameter to select an on-premises file or drag an on-premises file to the dotted-line rectangle. Then, configure the Whether To Remove Dirty Data parameter. Valid values:
Yes: If dirty data is identified, the platform ignores it and continues to upload data.
No: If dirty data is identified, the platform does not ignore it and blocks the data upload.
NoteYou can upload files in the
CSV
orXLSX
format. If you upload aCSV
file, the file can be up to 5 GB in size. If you upload anXLSX
file, the file can be up to 100 MB in size.By default, only the data of the first sheet in a file is uploaded. If you want to upload the data of multiple sheets in a file, create a separate file for each sheet and make sure that the sheet from which you want to upload data is the first sheet of the created file.
Dirty data: For example, if the data of a cell in a file is of the string type but is mapped to a destination field of the INT type, the data in the row fails to be written and is identified as dirty data. The specific dirty data is determined based on the actual judgment logic of the platform.
Configure the table in which you want to store the data that you want to upload.
You can store the data that you want to upload to an existing table or a new table of a specified data source.
The following table describes the parameters.
Parameter
Description
Compute Engine
You can upload data only to MaxCompute, EMR Hive, or Hologres.
MaxComputeProject Name or Data Source
The project or data source in which you want to store the data that you want to upload. The required parameters vary based on the type of compute engine. You can check the parameters in the DataWorks console.
NoteIf you set the Compute Engine parameter to EMR HIVE, you can select only a data source that is added in Alibaba Cloud instance mode.
Projects in the production environment are differentiated from projects in the development environment.
If you select a project in the production environment, you can select only a table in the production environment as the destination table.
If you select a project in the development environment, you can select only a table in the development environment as the destination table.
Destination Table (set to Existing Table)
Select Destination Table: Select the table in which you want to store the data that you want to upload. You can enter a keyword to search for the desired table.
NoteYou can upload data only to a table that you own. For more information, see the Limits section in this topic.
Upload Method: the method that is used to add data to the destination table. Configure this parameter based on the mappings between source fields and destination fields, which are configured in the next step.
If you set the Upload Method parameter to Clear Table Data First, the system clears data in the destination table, and then imports all data to the mapped fields in the destination table.
If you set the Upload Method parameter to Append, the data that you want to upload is appended to the mapped fields in the destination table.
Policy For Primary Key Conflict: the policy that is used to handle a primary key conflict in the destination table during data upload. Valid values:
Ignore: The uploaded data is ignored. The data in the destination table is not updated.
Update (replace): The uploaded data overwrites all old data in the destination table. NULL is forcefully written to the fields for which column mappings are not configured.
update: The uploaded data overwrites only the field data for which column mappings are configured in the destination table.
NoteThis parameter is required only for a Hologres compute engine.
Destination Table (set to Create Table)
Table Name: the name of the new table.
Table Type: Select Non-partitioned Table or Partitioned Table. If you set the parameter to Partitioned Table, you must specify the partition field and the value of the field.
Lifecycle: the validity period of the table. After the validity period elapses, the table may become unavailable. For more information about the table lifecycle, see Lifecycle and Lifecycle management operations.
NoteYou cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.
Preview the data that you want to upload and specify fields in the destination table.
After you select the data that you want to upload and the destination table to which you want to store the data, you can preview the data details, and configure mappings between fields in the source file and fields in the destination table. Data can be uploaded only after you configure the mappings.
NoteYou can preview only the first 20 data records.
The following table describes the parameters.
Parameter
Description
Settings for fields in the destination table when Destination Table is set to Existing Table
You must configure mappings between fields in the data file and fields in the destination table. Data can be uploaded only after you configure the mappings. The mapping methods are Mapping by Column Name and Mapping by Order. You can also configure the name of a mapped field in the destination table.
NoteIf no mapping exists between the data that you want to upload and the destination fields, the data is dimmed and not uploaded.
One-to-more mappings are not supported.
The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.
Settings for fields in the destination table when Destination Table is set to Create Table
You can click Intelligent Field Generation to allow the system to fill out the field information. You can also manually modify the field information.
NoteThe Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.
You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.
File Encoding Format
If the data that you want to upload to the destination table contains garbled characters, you can switch to other available encoding formats. Valid values: UTF-8, GB18030, and Big5.
Ignore First Row
Specifies whether to upload the first row of the data file to the destination table. In most cases, the first row contains column names.
If you select the check box, the first row of the file is not uploaded to the destination table.
If you do not select the check box, the first row of the file is uploaded to the destination table.
Click Upload Data to upload the data.
Upload DataAnalysis workbooks
Select the data that you want to upload.
Data Source: Select Workbook.
Specify Data to Be Uploaded: Select a created workbook and configure the Whether To Remove Dirty Data parameter.
Yes: If dirty data is identified, the platform ignores it and continues to upload data.
No: If dirty data is identified, the platform does not ignore it and blocks the data upload.
NoteFor information about how to create a workbook and import data to the workbook, see Create and manage a workbook and Import data to a workbook.
Dirty data: For example, if the data of a cell in a file is of the string type but is mapped to a destination field of the INT type, the data in the row fails to be written and is identified as dirty data. The specific dirty data is determined based on the actual judgment logic of the platform.
Configure the table in which you want to store the data that you want to upload.
You can store the data that you want to upload to an existing table or a new table of a specified data source.
The following table describes the parameters.
Parameter
Description
Compute Engine
You can upload data only to MaxCompute, EMR Hive, or Hologres.
MaxComputeProject Name or Data Source
The project or data source in which you want to store the data that you want to upload. The required parameters vary based on the type of compute engine. You can check the parameters in the DataWorks console.
NoteIf you set the Compute Engine parameter to EMR HIVE, you can select only a data source that is added in Alibaba Cloud instance mode.
Projects in the production environment are differentiated from projects in the development environment.
If you select a project in the production environment, you can select only a table in the production environment as the destination table.
If you select a project in the development environment, you can select only a table in the development environment as the destination table.
Destination Table (set to Existing Table)
Select Destination Table: Select the table in which you want to store the data that you want to upload. You can enter a keyword to search for the desired table.
NoteYou can upload data only to a table that you own. For more information, see the Limits section in this topic.
Upload Method: the method that is used to add data to the destination table. Configure this parameter based on the mappings between source fields and destination fields, which are configured in the next step.
If you set the Upload Method parameter to Clear Table Data First, the system clears data in the destination table, and then imports all data to the mapped fields in the destination table.
If you set the Upload Method parameter to Append, the data that you want to upload is appended to the mapped fields in the destination table.
Policy For Primary Key Conflict: the policy that is used to handle a primary key conflict in the destination table during data upload. Valid values:
Ignore: The uploaded data is ignored. The data in the destination table is not updated.
Update (replace): The uploaded data overwrites all old data in the destination table. NULL is forcefully written to the fields for which column mappings are not configured.
update: The uploaded data overwrites only the field data for which column mappings are configured in the destination table.
NoteThis parameter is required only for a Hologres compute engine.
Destination Table (set to Create Table)
Table Name: the name of the new table.
Table Type: Select Non-partitioned Table or Partitioned Table. If you set the parameter to Partitioned Table, you must specify the partition field and the value of the field.
Lifecycle: the validity period of the table. After the validity period elapses, the table may become unavailable. For more information about the table lifecycle, see Lifecycle and Lifecycle management operations.
NoteYou cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.
Preview the data that you want to upload and specify fields in the destination table.
After you select the data that you want to upload and the destination table to which you want to store the data, you can preview the data details, and configure mappings between fields in the source file and fields in the destination table. Data can be uploaded only after you configure the mappings.
NoteYou can preview only the first 20 data records.
The following table describes the parameters.
Parameter
Description
Settings for fields in the destination table when Destination Table is set to Existing Table
You must configure mappings between fields in the data file and fields in the destination table. Data can be uploaded only after you configure the mappings. The mapping methods are Mapping by Column Name and Mapping by Order. You can also configure the name of a mapped field in the destination table.
NoteIf no mapping exists between the data that you want to upload and the destination fields, the data is dimmed and not uploaded.
One-to-more mappings are not supported.
The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.
Settings for fields in the destination table when Destination Table is set to Create Table
You can click Intelligent Field Generation to allow the system to fill out the field information. You can also manually modify the field information.
NoteThe Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.
You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.
File Encoding Format
If the data that you want to upload to the destination table contains garbled characters, you can switch to other available encoding formats. Valid values: UTF-8, GB18030, and Big5.
Ignore First Row
Specifies whether to upload the first row of the data file to the destination table. In most cases, the first row contains column names.
If you select the check box, the first row of the file is not uploaded to the destination table.
If you do not select the check box, the first row of the file is uploaded to the destination table.
Click Upload Data and upload the desired data by following the on-screen instructions.
Upload OSS objects
Select the data that you want to upload.
Data Source: Select Alibaba Cloud OSS.
Specify Data to Be Uploaded: Select an object in the created bucket and configure the Whether To Remove Dirty Data parameter.
Yes: If dirty data is identified, the platform ignores it and continues to upload data.
No: If dirty data is identified, the platform does not ignore it and blocks the data upload.
NoteYou can upload data only from buckets in the same region as the current DataWorks workspace. For information about how to create a bucket, see Create a bucket.
Dirty data: For example, if the data of a cell in a file is of the string type but is mapped to a destination field of the INT type, the data in the row fails to be written and is identified as dirty data. The specific dirty data is determined based on the actual judgment logic of the platform.
Configure the table in which you want to store the data that you want to upload.
You can store the data that you want to upload to an existing table or a new table of a specified data source.
The following table describes the parameters.
Parameter
Description
Compute Engine
You can upload data only to MaxCompute, EMR Hive, or Hologres.
MaxComputeProject Name or Data Source
The project or data source in which you want to store the data that you want to upload. The required parameters vary based on the type of compute engine. You can check the parameters in the DataWorks console.
NoteIf you set the Compute Engine parameter to EMR HIVE, you can select only a data source that is added in Alibaba Cloud instance mode.
Projects in the production environment are differentiated from projects in the development environment.
If you select a project in the production environment, you can select only a table in the production environment as the destination table.
If you select a project in the development environment, you can select only a table in the development environment as the destination table.
Destination Table (set to Existing Table)
Select Destination Table: Select the table in which you want to store the data that you want to upload. You can enter a keyword to search for the desired table.
NoteYou can upload data only to a table that you own. For more information, see the Limits section in this topic.
Upload Method: the method that is used to add data to the destination table. Configure this parameter based on the mappings between source fields and destination fields, which are configured in the next step.
If you set the Upload Method parameter to Clear Table Data First, the system clears data in the destination table, and then imports all data to the mapped fields in the destination table.
If you set the Upload Method parameter to Append, the data that you want to upload is appended to the mapped fields in the destination table.
Policy For Primary Key Conflict: the policy that is used to handle a primary key conflict in the destination table during data upload. Valid values:
Ignore: The uploaded data is ignored. The data in the destination table is not updated.
Update (replace): The uploaded data overwrites all old data in the destination table. NULL is forcefully written to the fields for which column mappings are not configured.
update: The uploaded data overwrites only the field data for which column mappings are configured in the destination table.
NoteThis parameter is required only for a Hologres compute engine.
Destination Table (set to Create Table)
Table Name: the name of the new table.
Table Type: Select Non-partitioned Table or Partitioned Table. If you set the parameter to Partitioned Table, you must specify the partition field and the value of the field.
Lifecycle: the validity period of the table. After the validity period elapses, the table may become unavailable. For more information about the table lifecycle, see Lifecycle and Lifecycle management operations.
NoteYou cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.
Preview the data that you want to upload and specify fields in the destination table.
After you select the data that you want to upload and the destination table to which you want to store the data, you can preview the data details, and configure mappings between fields in the source file and fields in the destination table. Data can be uploaded only after you configure the mappings.
NoteYou can preview only the first 20 data records.
The following table describes the parameters.
Parameter
Description
Settings for fields in the destination table when Destination Table is set to Existing Table
You must configure mappings between fields in the data file and fields in the destination table. Data can be uploaded only after you configure the mappings. The mapping methods are Mapping by Column Name and Mapping by Order. You can also configure the name of a mapped field in the destination table.
NoteIf no mapping exists between the data that you want to upload and the destination fields, the data is dimmed and not uploaded.
One-to-more mappings are not supported.
The Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.
Settings for fields in the destination table when Destination Table is set to Create Table
You can click Intelligent Field Generation to allow the system to fill out the field information. You can also manually modify the field information.
NoteThe Field Name and Field Type parameters of the source file must be configured. Otherwise, the data cannot be uploaded.
You cannot set the Destination Table parameter to Create Table for an EMR Hive or Hologres compute engine on the Upload Data page. You must create a table in DataStudio before you can select the table for the Destination Table parameter on the Upload Data page. For information about how to create a table, see Manage tables.
File Encoding Format
If the data that you want to upload to the destination table contains garbled characters, you can switch to other available encoding formats. Valid values: UTF-8, GB18030, and Big5.
Ignore First Row
Specifies whether to upload the first row of the data file to the destination table. In most cases, the first row contains column names.
If you select the check box, the first row of the file is not uploaded to the destination table.
If you do not select the check box, the first row of the file is uploaded to the destination table.
Click Upload Data to upload the data.
What to do next
After you upload the data, you can perform the following operations based on your business requirements:
Data query: You can use DataAnalysis to query and analyze data. For more information, see SQL query.
View the details of the uploaded data: On the Data Upload page, you can click the name of the destination table to go to the DataMap page and view the details of the destination table. For more information, see Query and manage common data.
Appendix: Compliance statement for cross-border data uploads
If you perform cross-border data transmission operations, such as transmitting data from China to outside China, or transmitting data between different countries or regions, make sure that you understand and comply with relevant compliance declarations beforehand. Otherwise, data may fail to be uploaded, and you may be held legally responsible.
Your business data in the cloud will be transmitted to the selected region or product deployment area when you perform cross-border data operations. You must make sure that the relevant operations comply with the following requirements:
You have the required permissions to process relevant business data in the cloud.
You use adequate data security protection technologies and strategies.
Data transmission operations comply with relevant laws and regulations. For example, the data to be transmitted does not contain content that is restricted or prohibited from transmission or disclosure by applicable laws.
We recommend that you obtain professional legal or compliance advice before you conduct data upload that may involve cross-border data transmission to ensure compliance with all applicable laws, regulations, and regulatory policies. For example, you can obtain valid permissions from the person who owns the personal information, complete the endorsement and filing of related contract provisions, and fulfill legal duties such as carrying out necessary security evaluations.
If you perform cross-border data operations without adhering to this compliance statement, you shall bear the corresponding legal consequences. You shall also be liable for any resulting losses suffered by Alibaba Cloud and its affiliated companies.
References
DataStudio also allows you to upload data in on-premises CSV files or text files to a MaxCompute table. For more information, see Import data to a MaxCompute table.
For more information about the operations that you can perform on a MaxCompute table, see Create and manage MaxCompute tables.
For more information about the operations that you can perform on a Hologres table, see Create a Hologres table.
For more information about the operations that you can perform on an EMR table, see Create an EMR table.