All Products
Search
Document Center

DataWorks:Prepare environment

Last Updated:Feb 20, 2025

This tutorial demonstrates how to use DataWorks for user profile analysis, covering data synchronization, transformation, and quality monitoring. To follow this tutorial successfully, you need to set up the required EMR cluster, DataWorks workspace, and environment configuration.

Business background

To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data based on time and location and perform refined operations on website traffic by using basic user profile data.

Prerequisites

To perform the tutorial operations successfully, you should read the experiment introduction to fully understand the user profile analysis process.

Precautions

  • Basic user information and website access logs of users that are required for tests in this experiment are provided.

  • The data in this experiment can be used only for experimental operations in DataWorks, and all the data is manual mock data.

  • This tutorial uses Data Development (DataStudio) (old version) for data transformation.

EMR environment preparation

Create an EMR cluster

You need to create an EMR cluster to integrate with DataWorks, enabling data processing tasks on the DataWorks platform. The key configurations for creating and setting up the EMR cluster are as follows:

Parameter

Value

Region

China (shanghai).

Business Scenario

Data Lake.

Product Version

Select the latest version.

Optional Services

Select components based on actual needs. The Hive component and OSS-HDFS component are mandatory in this case.

Metadata

DLF Unified Metadata.

Cluster Storage Root Path

Select the OSS-HDFS instance. If the drop-down list is empty, click Create OSS-HDFS Instance.

For detailed instructions on creating an EMR cluster, see Create a Cluster.

Note

DataWorks' support for EMR cluster configurations varies. Before creating an EMR cluster, refer to Best Practices for Configuring DataWorks on an EMR Cluster.

DataWorks environment preparation

Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Prepare an environment.

Step 1: Create a workspace

If you have an existing workspace in the China (Shanghai) region, you can use it and skip this step.

  1. Log in to the DataWorks console and switch to the China (Shanghai) region in the upper-left corner.

  2. Click Workspace in the left-side navigation pane to access the list of spaces. To create a workspace in standard mode, which isolates production and development environments, click Create Workspace. For details, see Create a workspace.

Step 2: Create a serverless resource group

This tutorial requires a DataWorks serverless resource group for data synchronization and scheduling. Purchase a serverless resource group and complete the necessary setup.

  1. Purchase a serverless resource group.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.

    2. Click Create Resource Group. On the purchase page, choose Region And Zone as China (Shanghai), enter a Resource Group Name, and set other parameters as instructed. Once complete, proceed with the payment following the on-screen instructions. For details on serverless resource group billing, see Serverless resource group billing.

      Note

      In this example, a serverless resource group that is deployed in the China (Shanghai) region is used. Note that serverless resource groups do not support cross-region operations.

  2. Configure the serverless resource group.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.

    2. Find the serverless resource group that you purchased, click Associate Workspace in the Actions column and then associate the resource group with the DataWorks workspace that you create as prompted.

    3. Enable Internet access for the serverless resource group.

      The test data required for this tutorial must be retrieved via the Internet. By default, the resource group established in the prior step lacks Internet access capability. To enable data acquisition, configure a public NAT gateway for the associated VPC, add an EIP, and connect to the public data network.

      1. Log on to the VPC console and go to the Internet NAT Gateway page. In the top navigation bar, select the China (Shanghai) region.

      2. Click Create Internet NAT Gateway. Configure the parameters that are described in the following table.

        Parameter

        Description

        Region

        Select China (Shanghai).

        VPC

        Select the virtual private cloud (VPC) and vSwitch with which the resource group is associated.

        To obtain the VPC and vSwitch with which the resource group is associated, perform the following steps: Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?

        Associate vSwitch

        Access Mode

        Select SNAT-enabled Mode.

        EIP

        Select Purchase EIP.

        Create Service-Linked Role

        Click Create Service-Linked Role to create a service-linked role. If this is the first time you create an Internet NAT gateway, this step is required.

        Note

        Retain the default values for other parameters that are not described in the preceding table.

      3. Click Buy Now. On the Confirm page, read the terms of service, select the Terms of Service check box, and then click Activate Now.

For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.

Step 3: Register the EMR cluster and complete resource group initialization

Register the EMR cluster with DataWorks to use it within the platform.

  1. Navigate to the EMR cluster registration page.

    1. Access the Management Center page.

      Log on to the DataWorks console. After switching the region to China (Shanghai), click More > Management Center in the left-side navigation pane. Select the desired workspace from the drop-down box and click Go To Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.

  2. Register the EMR cluster.

    On the Register E-mapreduce Cluster page, you can configure the cluster information. Key parameters for configuration are detailed below.

    Parameter

    Value

    Cluster Alibaba Cloud Account

    Current Alibaba Cloud Account.

    Cluster Type

    Data Lake (datalake).

    Default Access Identity

    Cluster Account: Hadoop.

    Pass Proxy User Information

    Pass.

  3. Initialize the resource group.

    1. On the Cluster Management page, locate the registered EMR cluster and click Resource Group Initialization in the upper right corner.

    2. Click Initialize next to the resource group you need to initialize.

    3. After completion, click Confirm .

    Important

    Ensure that the resource group initialization is successful. If it fails, tasks depending on the resource group may not run correctly. In case of failure, diagnose network connectivity issues as suggested.

For a comprehensive guide on registering an EMR cluster, refer to Register an EMR Cluster to DataWorks.

What to do next

With the environment set up, you can move on to the next tutorial. You'll learn to synchronize basic user information and website access logs to OSS, create tables, and query the data using EMR Hive nodes. For more details, see Synchronize Data.