DataWorks allows you to register only E-MapReduce (EMR) DataLake clusters to a DataWorks workspace. To use the projects that you created in Hadoop clusters, you must migrate the projects to a DataWorks workspace for data development. This topic describes how to migrate EMR projects to a DataWorks workspace.
Prerequisites
DataWorks is activated and a DataWorks workspace is created. For information about how to create a DataWorks workspace, see Create and manage workspaces.
If you want to perform the migration as a RAM user, you must make sure that the RAM user is assigned the workspace administrator role and is attached with the
AliyunDataWorksFullAccess
andAliyunEMRFullAccess
policies. For more information, see the Add a RAM user to a workspace as a member and assign roles to the member section of the "Manage permissions on workspace-level services" topic and Grant permissions to a RAM user.The EMR cluster whose projects you want to migrate is registered to the DataWorks workspace that you created. For more information, see Register an EMR cluster to DataWorks.
Background information
You can use one of the following methods to migrate workflows (nodes and scheduling settings), manually executed jobs, resources, and data sources from an EMR cluster to a DataWorks workspace:
After you trigger the migration, you can go to the Migration Assistant page in the DataWorks console to view the migration progress, migration results, and migration reports. For more information, see the View the migration reports and result section in this topic.
The following table lists the mappings between the original job types in EMR projects and the job types after the EMR projects are migrated to a DataWorks workspace.
Original job type | Job type after project migration to DataWorks |
SQOOP | Data Integration (Batch synchronization) |
SPARK_SQL | EMR_SPARK_SQL |
SPARK | EMR_SPARK |
SHELL | EMR_SHELL |
PRESTO_SQL | EMR_PRESTO |
MR | EMR_MR |
IMPALA_SQL | EMR_IMPALA |
HIVE_SQL | EMR_HIVE |
HIVE | EMR_SHELL |
Method 1: Use DataWorks Migration Assistant to export an EMR project as a package and then import the package to a DataWorks workspace
In the DataWorks console, you can export the nodes, scheduling settings, manually executed jobs, resources, and data sources that are stored in an EMR cluster as a package, and then import the package to a DataWorks workspace. The Migration Assistant service of DataWorks in different editions provides different migration policies. Different roles are granted different permissions to use the Migration Assistant service. For more information, see the Limits section of the "Create and view export tasks" topic.
If you use the Migration Assistant service as a RAM user, make sure that the AliyunEMRFullAccess
policy is attached to the RAM user. Otherwise, the system reports an error when you select a value from the Project Name drop-down list. For information about how to attach a policy to a RAM user, see Grant permissions to a RAM user.
Go to the Migration Assistant page in the DataWorks console.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
On the DataStudio page, click the icon in the upper-left corner and choose .
Export a project from EMR as a package.
In the left-side navigation pane of the Migration Assistant page, choose .
On the Schemes of Scheduling Engine Export page, click the EMR tab. Then, click Create Export Task.
In the Create Export Task dialog box, configure the parameters.
After the project is exported, return to the Schemes of Scheduling Engine Export page to view the export result. Click Download Export Package in the Actions column that corresponds to the export task to download the exported package to your on-premises machine.
NoteThe download link is valid for 30 days. We recommend that you download the package before the validity period ends. After the validity period ends, you need to re-export the project if you want to download the package.
Import the downloaded package to a DataWorks workspace.
Create an import task.
In the left-side navigation pane of the Migration Assistant page, choose
. On the page that appears, click Create Import Task.In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.
Parameter
Description
Import Package Name
The name of the import task. You can specify a custom name for the import task.
Scheduling Engine
The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.
Upload Method
The source of the package that you want to import. Valid values:
Local Upload: Select this mode if the package is less than or equal to 30 MB in size.
OSS Object: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.
NoteFor information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.
Select File
The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.
NoteThis parameter is required only if you select Local Upload for the Upload Method parameter.
OSS Endpoint
The OSS URL of the EMR project that you want to import.
NoteThis parameter is required only if you select OSS Object for the Upload Method parameter.
File Name
The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.
NoteThis parameter is required only if you select Local Upload for the Upload Method parameter.
Remarks
The description of the import task.
On the Edit Import Task page, check the project that you want to import and click Start Import in the upper-right corner.
The system starts to import the project.
You can go to the Import Task List page to view the migration progress. For more information, see the View the migration reports and result section in this topic.
Method 2: Package an EMR project by using a tool and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace
You can run commands to package an EMR project and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace.
Before you use this method, you must install a Python environment on your on-premises machine.
Package an EMR project to your on-premises machine.
Download the package of the project packaging tool migrationx-reader to your on-premises machine.
Run a command to package the EMR project that you want to migrate.
Decompress the package of the project packaging tool and run the following command in the Python environment:
python ./migrationx-reader/bin/reader.py -a aliyunemr -d . -i $accessId -k $accessKey -p $project -e emr.aliyuncs.com -r $regionId
Take note of the following parameters:
$accessId $accessKey: the AccessKey pair of the user account that is used to perform the packaging operation.
$project: the name of the EMR project that you want to package.
$regionId: the ID of the region where the EMR project resides.
Use DataWorks Migration Assistant to import the package of the EMR project.
Create an import task.
In the left-side navigation pane of the Migration Assistant page, choose
. On the page that appears, click Create Import Task.In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.
Parameter
Description
Import Package Name
The name of the import task. You can specify a custom name for the import task.
Scheduling Engine
The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.
Upload Method
The source of the package that you want to import. Valid values:
Local Upload: Select this mode if the package is less than or equal to 30 MB in size.
OSS Object: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.
NoteFor information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.
Select File
The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.
NoteThis parameter is required only if you select Local Upload for the Upload Method parameter.
OSS Endpoint
The OSS URL of the EMR project that you want to import.
NoteThis parameter is required only if you select OSS Object for the Upload Method parameter.
File Name
The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.
NoteThis parameter is required only if you select Local Upload for the Upload Method parameter.
Remarks
The description of the import task.
On the Edit Import Task page, check the project that you want to import and click Start Import in the upper-right corner.
The system starts to import the project.
You can go to the Import Task List page to view the migration progress. For more information, see the View the migration reports and result section in this topic.
View the migration reports and result
After a project is migrated, you can go to the Migration Assistant page to view the migration progress, migration result, and migration reports.
Import
On the Import Task List page in Migration Assistant, find the desired import task and click View Import Report in the Actions column.
Export
On the Schemes of Scheduling Engine Export page in Migration Assistant, click the EMR tab, find the desired export task, and then click View Export Report in the Actions column.