All Products
Search
Document Center

MaxCompute:Use DataWorks (offline and real-time)

Last Updated:Dec 02, 2024

MaxCompute allows you to use the Data Integration service of DataWorks to import data from other data sources to MaxCompute in offline mode or in real time. You can also use DataWorks to import data from specific types of on-premises files to MaxCompute. This topic describes the procedure and notes of using DataWorks to import data to MaxCompute.

Prerequisites

  • DataWorks is activated, and a MaxCompute compute engine is associated with a DataWorks workspace. A table is created in the MaxCompute compute engine and is used to store the data that is synchronized to MaxCompute. For more information, see Create a MaxCompute project and Create tables.

  • The data that you want to import to MaxCompute is prepared.

Scenario 1: Import data from an on-premises CSV file to MaxCompute

  • Synchronization capabilities

    You can import a CSV file to MaxCompute as an on-premises file or from an Alibaba Cloud Object Storage Service (OSS) bucket.

    • If you upload a CSV file to MaxCompute as an on-premises file, the file can be up to 5 GB in size.

    • If you upload a CSV file to MaxCompute from an OSS bucket, the bucket must be in the same region as the current MaxCompute project.

  • Operation entry points

    1. Go to the DataStudio page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

    2. In the upper-left corner of the DataStudio page, click the image.png icon and choose All Products > Data Integration > Upload and Download.

    3. In the left-side navigation pane of the Upload and Download page, click the image.png icon to go to the Upload Data page.

    4. Click Upload Data and upload the desired data by following the on-screen instructions.

  • General operation instructions

    For more information, see Upload data.

Scenario 2: Import data from on-premises files to MaxCompute

  • Synchronization capabilities

    You can import data from CSV files and custom text files to MaxCompute. Custom text files in the .txt, .csv, and .log formats are supported.

  • Operation entry points and general operation instructions

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Developement and Governance > Data Development. On the page that appears, select a workspace from the drop-down list and click Go to DataStudio. You can import data from on-premises files to MaxCompute by using the following entry points:

    • Click the Import Data icon in the top toolbar of the Scheduled Workflow pane.上传数据

    • In the Scheduled Workflow pane, find the desired workflow, right-click the name of the table to which you want to import data in the MaxCompute folder, and then select Import Data.上传数据2

    • In a workspace in standard mode, click Workspace Tables in the left-side navigation pane of the DataStudio page, right-click the name of the table to which you want to import data, and then select Import Data.上传数据3

    For more information, see Import data to a MaxCompute table.

    Note

    If the table that you create to store synchronized data cannot be found when you search for the table, you can manually synchronize the table in DataMap and then search for the table again. For information about how to manually synchronize a table, see My Data.

    • In the upper-left corner of the DataStudio page, click the image.png icon and choose All Products > Upload and Download. In the left-side navigation pane of the page that appears, click the image.png icon. On the Data Upload page, click Data Upload.

      Note

      The upload and download feature of DataWorks allows you to upload data as an on-premises file. Only CSV files with a maximum size of 5 GB are supported.

      For more information, see Upload data.

Scenario 3: Import data from other data sources to MaxCompute

  • Synchronization capabilities

    DataWorks Data Integration allows you to synchronize data from other data sources to MaxCompute. For example, you can synchronize data from databases such as an ApsaraDB RDS database to MaxCompute. The data synchronization principles and capabilities vary based on the synchronization scenario.

    • The batch synchronization feature provides readers and writers for you to read data from and write data to data sources in offline mode.

    • The real-time synchronization feature allows you to configure a data synchronization task by using different data sources to synchronize incremental data from a single table or all tables in a database in real time.

    • The solution-based synchronization feature provides data synchronization solutions that are used to synchronize data between different data sources in various scenarios, such as batch synchronization of data from all tables in a database, and one-time full synchronization and real-time incremental synchronization.

    The following table describes the capabilities of synchronizing data from or to MaxCompute data sources.

    Batch synchronization

    Real-time synchronization

    Solution-based synchronization

    Read data from a single table

    Write data to a single table

    Read incremental data from a single table

    Write incremental data to a single table

    Read incremental data from all tables in a database

    Write incremental data of all tables in a database to a destination

    Read data from all tables in a database in offline mode

    Write data of all tables in a database to a destination in offline mode

    Read full data at a time and incremental data in real time from a table or all tables in a database

    Write full data and incremental data of a table or all tables in a database to a destination

    image..png

    image..png

    -

    image..png

    -

    image..png

    -

    image..png

    -

    image..png

    Note

    In the offline import scenario, each batch synchronization node can import data in one or more tables to only one table in MaxCompute.

    For more information about the capabilities of synchronizing data from or to MaxCompute data sources by using DataWorks Data Integration, see MaxCompute data source.

  • Operation entry points and general operation instructions

  • Billing rules

    When you use DataWorks Data Integration to synchronize data, you must use resource groups for Data Integration and resource groups for scheduling. You can use shared or exclusive resource groups based on your business requirements. If Internet traffic is generated when you synchronize data between data sources, you are charged for the generated Internet traffic. Billing details:

Best practices

Synchronize all data in a database to MaxCompute in offline mode

Synchronize incremental data in a database to MaxCompute in offline mode

Synchronize data from tables in sharded databases to MaxCompute

Synchronize full and incremental data in a database to MaxCompute in real time