All Products
Search
Document Center

DataWorks:Migrate EMR projects to DataWorks

Last Updated:Nov 14, 2024

DataWorks allows you to register only E-MapReduce (EMR) DataLake clusters to a DataWorks workspace. To use the projects that you created in Hadoop clusters, you must migrate the projects to a DataWorks workspace for data development. This topic describes how to migrate EMR projects to a DataWorks workspace.

Prerequisites

Background information

You can use one of the following methods to migrate workflows (nodes and scheduling settings), manually executed jobs, resources, and data sources from an EMR cluster to a DataWorks workspace:

After you trigger the migration, you can go to the Migration Assistant page in the DataWorks console to view the migration progress, migration results, and migration reports. For more information, see the View the migration reports and result section in this topic.

The following table lists the mappings between the original job types in EMR projects and the job types after the EMR projects are migrated to a DataWorks workspace.

Original job type

Job type after project migration to DataWorks

SQOOP

Data Integration (Batch synchronization)

SPARK_SQL

EMR_SPARK_SQL

SPARK

EMR_SPARK

SHELL

EMR_SHELL

PRESTO_SQL

EMR_PRESTO

MR

EMR_MR

IMPALA_SQL

EMR_IMPALA

HIVE_SQL

EMR_HIVE

HIVE

EMR_SHELL

Method 1: Use DataWorks Migration Assistant to export an EMR project as a package and then import the package to a DataWorks workspace

In the DataWorks console, you can export the nodes, scheduling settings, manually executed jobs, resources, and data sources that are stored in an EMR cluster as a package, and then import the package to a DataWorks workspace. The Migration Assistant service of DataWorks in different editions provides different migration policies. Different roles are granted different permissions to use the Migration Assistant service. For more information, see the Limits section of the "Create and view export tasks" topic.

Note

If you use the Migration Assistant service as a RAM user, make sure that the AliyunEMRFullAccess policy is attached to the RAM user. Otherwise, the system reports an error when you select a value from the Project Name drop-down list. For information about how to attach a policy to a RAM user, see Grant permissions to a RAM user.

  1. Go to the Migration Assistant page in the DataWorks console.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

    2. On the DataStudio page, click the 图标 icon in the upper-left corner and choose All Products > More > Migration Assistant.

  2. Export a project from EMR as a package.

    1. In the left-side navigation pane of the Migration Assistant page, choose Task Migration to Cloud > Scheduling Engine Export.

    2. On the Schemes of Scheduling Engine Export page, click the EMR tab. Then, click Create Export Task.

    3. In the Create Export Task dialog box, configure the parameters.

      导出作业

    4. After the project is exported, return to the Schemes of Scheduling Engine Export page to view the export result. Click Download Export Package in the Actions column that corresponds to the export task to download the exported package to your on-premises machine.

      Note

      The download link is valid for 30 days. We recommend that you download the package before the validity period ends. After the validity period ends, you need to re-export the project if you want to download the package.

      导出方案

  3. Import the downloaded package to a DataWorks workspace.

    1. Create an import task.

      In the left-side navigation pane of the Migration Assistant page, choose Task Migration to Cloud > Scheduling Engine Import. On the page that appears, click Create Import Task.

    2. In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.

      导入任务

      Parameter

      Description

      Import Package Name

      The name of the import task. You can specify a custom name for the import task.

      Scheduling Engine

      The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.

      Upload Method

      The source of the package that you want to import. Valid values:

      • Local Upload: Select this mode if the package is less than or equal to 30 MB in size.

      • OSS Object: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.

        Note

        For information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.

        下载链接

      Select File

      The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.

      Note

      This parameter is required only if you select Local Upload for the Upload Method parameter.

      OSS Endpoint

      The OSS URL of the EMR project that you want to import.

      Note

      This parameter is required only if you select OSS Object for the Upload Method parameter.

      File Name

      The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.

      Note

      This parameter is required only if you select Local Upload for the Upload Method parameter.

      Remarks

      The description of the import task.

    3. On the Edit Import Task page, check the project that you want to import and click Start Import in the upper-right corner.

    4. The system starts to import the project.

      You can go to the Import Task List page to view the migration progress. For more information, see the View the migration reports and result section in this topic.

Method 2: Package an EMR project by using a tool and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace

You can run commands to package an EMR project and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace.

Note

Before you use this method, you must install a Python environment on your on-premises machine.

  1. Package an EMR project to your on-premises machine.

    1. Download the package of the project packaging tool migrationx-reader to your on-premises machine.

    2. Run a command to package the EMR project that you want to migrate.

      Decompress the package of the project packaging tool and run the following command in the Python environment:

      python ./migrationx-reader/bin/reader.py -a aliyunemr -d . -i $accessId -k $accessKey -p $project -e emr.aliyuncs.com -r $regionId

      Take note of the following parameters:

      • $accessId $accessKey: the AccessKey pair of the user account that is used to perform the packaging operation.

      • $project: the name of the EMR project that you want to package.

      • $regionId: the ID of the region where the EMR project resides.

  2. Use DataWorks Migration Assistant to import the package of the EMR project.

    1. Create an import task.

      In the left-side navigation pane of the Migration Assistant page, choose Task Migration to Cloud > Scheduling Engine Import. On the page that appears, click Create Import Task.

    2. In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.

      导入任务

      Parameter

      Description

      Import Package Name

      The name of the import task. You can specify a custom name for the import task.

      Scheduling Engine

      The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.

      Upload Method

      The source of the package that you want to import. Valid values:

      • Local Upload: Select this mode if the package is less than or equal to 30 MB in size.

      • OSS Object: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.

        Note

        For information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.

        下载链接

      Select File

      The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.

      Note

      This parameter is required only if you select Local Upload for the Upload Method parameter.

      OSS Endpoint

      The OSS URL of the EMR project that you want to import.

      Note

      This parameter is required only if you select OSS Object for the Upload Method parameter.

      File Name

      The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.

      Note

      This parameter is required only if you select Local Upload for the Upload Method parameter.

      Remarks

      The description of the import task.

    3. On the Edit Import Task page, check the project that you want to import and click Start Import in the upper-right corner.

    4. The system starts to import the project.

      You can go to the Import Task List page to view the migration progress. For more information, see the View the migration reports and result section in this topic.

View the migration reports and result

After a project is migrated, you can go to the Migration Assistant page to view the migration progress, migration result, and migration reports.

  • Import

    On the Import Task List page in Migration Assistant, find the desired import task and click View Import Report in the Actions column.导入报告入口

  • Export

    On the Schemes of Scheduling Engine Export page in Migration Assistant, click the EMR tab, find the desired export task, and then click View Export Report in the Actions column.查看报告