MaxCompute allows you to use the Data Integration service of DataWorks to import data from other data sources to MaxCompute in offline mode or in real time. You can also use DataWorks to import data from specific types of on-premises files to MaxCompute. This topic describes the procedure and notes of using DataWorks to import data to MaxCompute.
Prerequisites
DataWorks is activated, and a MaxCompute compute engine is associated with a DataWorks workspace. A table is created in the MaxCompute compute engine and is used to store the data that is synchronized to MaxCompute. For more information, see Create a MaxCompute project and Create tables.
The data that you want to import to MaxCompute is prepared.
Scenario 1: Import data from an on-premises CSV file to MaxCompute
Synchronization capabilities
You can import a CSV file to MaxCompute as an on-premises file or from an Alibaba Cloud Object Storage Service (OSS) bucket.
If you upload a CSV file to MaxCompute as an on-premises file, the file can be up to 5 GB in size.
If you upload a CSV file to MaxCompute from an OSS bucket, the bucket must be in the same region as the current MaxCompute project.
Operation entry points
Go to the DataStudio page.
Log on to the DataWorks console. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
In the upper-left corner of the DataStudio page, click the icon and choose
.In the left-side navigation pane of the Upload and Download page, click the upload icon . The Data Upload page appears.
Click Data Upload and upload the desired data by following the on-screen instructions.
General operation instructions
For more information, see Upload data.
Scenario 2: Import data from on-premises files to MaxCompute
Synchronization capabilities
You can import data from CSV files and custom text files to MaxCompute. Custom text files in the .txt, .csv, and .log formats are supported.
Operation entry points and general operation instructions
Log on to the DataWorks console. In the left-side navigation pane, choose . On the page that appears, select a workspace from the drop-down list and click Go to DataStudio. You can import data from on-premises files to MaxCompute by using the following entry points:
- Click the Import icon in the top toolbar of the Scheduled Workflow pane.
- In the Scheduled Workflow pane, find the required workflow, right-click the table to which you want to import data in the MaxCompute folder, and then select Import Data.
- In a workspace in standard mode, click Workspace Tables in the left-side navigation pane, right-click the table to which you want to import data, and then select Import Data.
For more information, see Import data to a MaxCompute table.
NoteIf the table that you create to store synchronized data cannot be found when you search for the table, you can manually synchronize the table in DataMap and then search for the table again. For information about how to manually synchronize a table, see My Data.
In the upper-left corner of the DataStudio page, click the icon and choose
. In the left-side navigation pane of the page that appears, click the icon. On the Data Upload page, click Data Upload.NoteThe upload and download feature of DataWorks allows you to upload data as an on-premises file. Only CSV files with a maximum size of 5 GB are supported.
For more information, see Upload data.
Scenario 3: Import data from other data sources to MaxCompute
Synchronization capabilities
DataWorks Data Integration allows you to synchronize data from other data sources to MaxCompute. For example, you can synchronize data from databases such as an ApsaraDB RDS database to MaxCompute. The data synchronization principles and capabilities vary based on the synchronization scenario.
The batch synchronization feature provides readers and writers for you to read data from and write data to data sources in offline mode.
The real-time synchronization feature allows you to configure a data synchronization task by using different data sources to synchronize incremental data from a single table or all tables in a database in real time.
The solution-based synchronization feature provides data synchronization solutions that are used to synchronize data between different data sources in various scenarios, such as batch synchronization of data from all tables in a database, and one-time full synchronization and real-time incremental synchronization.
The following table describes the capabilities of synchronizing data from or to MaxCompute data sources.
Batch synchronization
Real-time synchronization
Solution-based synchronization
Read data from a single table
Write data to a single table
Read incremental data from a single table
Write incremental data to a single table
Read incremental data from all tables in a database
Write incremental data of all tables in a database to a destination
Read data from all tables in a database in offline mode
Write data of all tables in a database to a destination in offline mode
Read full data at a time and incremental data in real time from a table or all tables in a database
Write full data and incremental data of a table or all tables in a database to a destination
-
-
-
-
NoteIn the offline import scenario, each batch synchronization node can import data in one or more tables to only one table in MaxCompute.
For more information about the capabilities of synchronizing data from or to MaxCompute data sources by using DataWorks Data Integration, see MaxCompute data source.
Operation entry points and general operation instructions
Operations performed in DataStudio
Log on to the MaxCompute console. In the left-side navigation pane, click Data Development to go to the DataStudio page of DataWorks. On the DataStudio page, you can create and configure a batch synchronization node or a real-time synchronization node to synchronize data from other data sources to MaxCompute.
When you configure the batch synchronization node, select MaxCompute as the destination and another data source as the source.
When you configure the real-time synchronization node, select MaxCompute as the output and another data source as the input.
For more information, see Configure a batch synchronization node by using the codeless UI, Configure a batch synchronization node by using the code editor, and Configure a real-time synchronization node in DataStudio.
Operations performed in Data Integration
Log on to the DataWorks console. On the Workspaces page, find the desired workspace and choose Shortcuts > Data Integration in the Actions column. On the Data Integration page, create a data synchronization task to synchronize data from other data sources to MaxCompute.
For more information, see Configure a data synchronization task in Data Integration.
Billing rules
When you use DataWorks Data Integration to synchronize data, you must use resource groups for Data Integration and resource groups for scheduling. You can use shared or exclusive resource groups based on your business requirements. If Internet traffic is generated when you synchronize data between data sources, you are charged for the generated Internet traffic. Billing details:
For information about the billing details of resource groups for Data Integration, see Billing of exclusive resource groups for Data Integration (subscription) and Billing of the shared resource group for Data Integration (pay-as-you-go).
For information about the billing details of Internet traffic, see Billing of Internet traffic.
For information about the billing details of resource groups for scheduling, see Billing of exclusive resource groups for scheduling (subscription) and Billing of the shared resource group for scheduling (pay-as-you-go).
Best practices
Synchronize all data in a database to MaxCompute in offline mode
Create a batch synchronization solution to synchronize all data in a database to MaxCompute
Synchronize data in an EMR Hive database to MaxCompute in offline mode