This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor the quality of data. To ensure that you can complete the tutorial as expected, you must first create an E-MapReduce (EMR) cluster and a DataWorks workspace and configure the development environment.
Prerequisites
DataWorks is activated. For more information, see Purchase guide.
NoteAll data resources involved in this tutorial reside in the China (Shanghai) region. We recommend that you activate DataWorks in the China (Shanghai) region.
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
Step 1: Create an OSS bucket
This tutorial requires an OSS bucket, which is used to store user information and website access logs for data modeling and data analysis.
Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket panel, configure the parameters and click OK.
Bucket Name: Configure this parameter based on your business requirements.
Region: Select China (Shanghai).
OSS-HDFS: Turn on this switch.
For more information, see Create buckets.
Go back to the Buckets page, find the bucket, and then click the bucket name to go to the Objects page.
Step 2: Create an EMR cluster
This tutorial also requires an EMR cluster, which needs to be registered to DataWorks. This allows you to run data processing tasks based on the EMR cluster in the DataWorks console.
For more information, see Create a cluster. When you create an EMR cluster, take note of the following items in the Software Configuration step:
Region: Select China (Shanghai).
Business Scenario: Select Data Lake.
Product Version: Select the latest version.
Optional Services: Select components based on your business requirements. This tutorial requires the Hive component.
Metadata: Select DLF Unified Metadata.
Root Storage Directory of Cluster: Select the OSS bucket for which the OSS-HDFS service is activated in Step 1.
The support of DataWorks for different configurations of an EMR cluster varies. Before you create an EMR cluster and develop EMR tasks in DataWorks based on the EMR cluster, we recommend that you read the Best practices for configuring EMR clusters used in DataWorks topic.
Step 3: Create a DataWorks workspace
Before you develop tasks in DataWorks, you must create a DataWorks workspace.
All the data resources involved in this tutorial reside in the China (Shanghai) region. We recommend that you create a DataWorks workspace in the China (Shanghai) region. If you create a workspace in a different region and want to add a data source to the workspace, the data source may fail the network connectivity test. To simplify operations, we recommend that you set the Isolate Development and Production Environments parameter to No when you create a workspace.
Log on to the DataWorks console.
In the left-side navigation pane, click Workspaces.
In the top navigation bar, select the China (Shanghai) region.
On the Workspaces page, click Create Workspace. In the Create Workspace panel, enter a name in the Workspace Name field. For more information, see Create a workspace.
Step 4: Configure the environment required to develop EMR tasks in DataWorks
Before you can develop and run EMR tasks in DataWorks, you must perform the following steps to prepare the required environment:
Purchase and configure a serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group.
Find the serverless resource group that you purchased, click Associate Workspace in the Actions column and then associate the resource group with the DataWorks workspace that you create as prompted.
Enable the serverless resource group to access the Internet.
Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.
Click Create NAT Gateway. Configure the parameters that are described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the virtual private cloud (VPC) and vSwitch with which the resource group is associated.
To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the region in which you activate DataWorks. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?
Associate vSwitch
Access Mode
Select SNAT for All VPC Resources.
EIP
Select Purchase EIP.
Service-linked Role
Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.
NoteYou can retain the default values for the parameters that are not described in the preceding table.
Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Confirm.
Optional. Add the RAM user that you want to use to the workspace as a member and grant the required permissions to the member.
Only workspace members can run EMR tasks in DataStudio. You can add the RAM user that you want to use to the workspace as a member and grant the required permissions to the member. For more information, see Manage permissions on workspace-level services.
NoteThe Alibaba Cloud account to which a workspace belongs and the RAM user that is used to create a workspace automatically become members of the workspaces and are assigned the Workspace Administrator role.
Register the EMR cluster to DataWorks and initialize the serverless resource group.
You can use the EMR cluster in DataWorks only if you register the cluster to DataWorks. For more information, see Register an EMR cluster to DataWorks.
ImportantYou must make sure that the initialization of the resource group is successful. Otherwise, tasks that use the resource group may fail. If the initialization of the resource group fails, you can view the failure cause and perform a network connectivity diagnosis as prompted.
Key parameters for registering an EMR cluster to DataWorks:
Alibaba Cloud Account To Which Cluster Belongs: Select Current Alibaba Cloud Account.
Cluster Type: Select Data Lake.
Default Access Identity: Select Cluster Account: Hadoop.