This tutorial demonstrates how to use DataWorks for user profile analysis, covering data synchronization, transformation, and quality monitoring. To follow this tutorial successfully, you need to set up the required EMR cluster, DataWorks workspace, and environment configuration.
Business background
To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data based on time and location and perform refined operations on website traffic by using basic user profile data.
Prerequisites
To perform the tutorial operations successfully, you should read the experiment introduction to fully understand the user profile analysis process.
Precautions
Basic user information and website access logs of users that are required for tests in this experiment are provided.
The data in this experiment can be used only for experimental operations in DataWorks, and all the data is manual mock data.
-
This tutorial uses Data Development (DataStudio) (old version) for data transformation.
EMR environment preparation
Create an EMR cluster
You need to create an EMR cluster to integrate with DataWorks, enabling data processing tasks on the DataWorks platform. The key configurations for creating and setting up the EMR cluster are as follows:
Parameter | Value |
Region | China (shanghai). |
Business Scenario | Data Lake. |
Product Version | Select the latest version. |
Optional Services | Select components based on actual needs. The Hive component and OSS-HDFS component are mandatory in this case. |
Metadata | DLF Unified Metadata. |
Cluster Storage Root Path | Select the OSS-HDFS instance. If the drop-down list is empty, click Create OSS-HDFS Instance. |
For detailed instructions on creating an EMR cluster, see Create a Cluster.
DataWorks' support for EMR cluster configurations varies. Before creating an EMR cluster, refer to Best Practices for Configuring DataWorks on an EMR Cluster.
DataWorks environment preparation
Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Prepare an environment.
Step 1: Create a workspace
If you have an existing workspace in the China (Shanghai) region, you can use it and skip this step.
-
Log in to the DataWorks console and switch to the China (Shanghai) region in the upper-left corner.
-
Click Workspace in the left-side navigation pane to access the list of spaces. To create a workspace in standard mode, which isolates production and development environments, click Create Workspace. For details, see Create a workspace.
Step 2: Create a serverless resource group
This tutorial requires a DataWorks serverless resource group for data synchronization and scheduling. Purchase a serverless resource group and complete the necessary setup.
-
Purchase a serverless resource group.
-
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
-
Click Create Resource Group. On the purchase page, choose Region And Zone as China (Shanghai), enter a Resource Group Name, and set other parameters as instructed. Once complete, proceed with the payment following the on-screen instructions. For details on serverless resource group billing, see Serverless resource group billing.
NoteIn this example, a serverless resource group that is deployed in the China (Shanghai) region is used. Note that serverless resource groups do not support cross-region operations.
-
-
Configure the serverless resource group.
-
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
Find the serverless resource group that you purchased, click Associate Workspace in the Actions column and then associate the resource group with the DataWorks workspace that you create as prompted.
-
Enable Internet access for the serverless resource group.
The test data required for this tutorial must be retrieved via the Internet. By default, the resource group established in the prior step lacks Internet access capability. To enable data acquisition, configure a public NAT gateway for the associated VPC, add an EIP, and connect to the public data network.
Log on to the VPC console and go to the Internet NAT Gateway page. In the top navigation bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway. Configure the parameters that are described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the virtual private cloud (VPC) and vSwitch with which the resource group is associated.
To obtain the VPC and vSwitch with which the resource group is associated, perform the following steps: Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?
Associate vSwitch
Access Mode
Select SNAT-enabled Mode.
EIP
Select Purchase EIP.
Create Service-Linked Role
Click Create Service-Linked Role to create a service-linked role. If this is the first time you create an Internet NAT gateway, this step is required.
NoteRetain the default values for other parameters that are not described in the preceding table.
Click Buy Now. On the Confirm page, read the terms of service, select the Terms of Service check box, and then click Activate Now.
-
For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.
Step 3: Register the EMR cluster and complete resource group initialization
Register the EMR cluster with DataWorks to use it within the platform.
-
Navigate to the EMR cluster registration page.
-
Access the Management Center page.
Log on to the DataWorks console. After switching the region to China (Shanghai), click in the left-side navigation pane. Select the desired workspace from the drop-down box and click Go To Management Center.
-
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.
-
-
Register the EMR cluster.
On the Register E-mapreduce Cluster page, you can configure the cluster information. Key parameters for configuration are detailed below.
Parameter
Value
Cluster Alibaba Cloud Account
Current Alibaba Cloud Account.
Cluster Type
Data Lake (datalake).
Default Access Identity
Cluster Account: Hadoop.
Pass Proxy User Information
Pass.
-
Initialize the resource group.
-
On the Cluster Management page, locate the registered EMR cluster and click Resource Group Initialization in the upper right corner.
-
Click Initialize next to the resource group you need to initialize.
-
After completion, click Confirm .
ImportantEnsure that the resource group initialization is successful. If it fails, tasks depending on the resource group may not run correctly. In case of failure, diagnose network connectivity issues as suggested.
-
For a comprehensive guide on registering an EMR cluster, refer to Register an EMR Cluster to DataWorks.
What to do next
With the environment set up, you can move on to the next tutorial. You'll learn to synchronize basic user information and website access logs to OSS, create tables, and query the data using EMR Hive nodes. For more details, see Synchronize Data.