Simple user profile analysis workshop - - Alibaba Cloud Documentation Center

This experiment uses the DataWorks and MaxCompute product combo to describe how to use DataWorks.

Experiment introduction

Background information

The features of DataWorks are used to accomplish the following purposes in website user profile analysis scenarios:

Collect data.
Process data.
Configure rules to monitor data quality.
Visualize data on a dashboard.

Intended audiences

Development engineers, data analysts, and engineers who query data from data warehouses and analyze and gain insights into data, such as product operations engineers

Used services

In this experiment, the following services are used:

DataWorks
In this experiment, DataWorks is used to collect and process data, monitor the quality of data, and visualize data. You must activate DataWorks in advance. For more information, see Activate DataWorks.
MaxCompute
MaxCompute is used to implement underlying data processing and computing. You must activate MaxCompute in advance. For more information, see Activate MaxCompute.
ApsaraDB RDS for MySQL
In this experiment, ApsaraDB RDS for MySQL is used to store user information. The basic information of an ApsaraDB RDS for MySQL data source is provided by default. You do not need to separately activate this service.
Object Storage Service (OSS)
In this experiment, the basic information of an OSS data source is provided by default. You do not need to separately activate this service.

Used DataWorks services

In this experiment, the following DataWorks services are used.

Step	Operation	Phase-specific goal
Collect data	Use the DataWorks Data Integration service to synchronize the user information that is stored in ApsaraDB RDS for MySQL and the website access logs of users that are stored in OSS to MaxCompute, commit the nodes that are used to process the data to the scheduling system, and then perform periodic synchronization of incremental data by using DataWorks scheduling parameters.	Learn the following items: Synchronize data from different data sources to MaxCompute. Quickly trigger a node to run. View node logs.
Process data	Use the DataWorks DataStudio service to split log data into analyzable fields by using methods such as functions and regular expressions. Aggregate the processed log data and the user information tables into basic user profile data. Then, commit the data to the scheduling system and perform periodic data cleansing operations by using DataWorks scheduling parameters.	Learn the following items: Create nodes in a DataWorks workflow. Configure periodic scheduling properties for nodes. Run a workflow. Visualize created data tables.
Configure rules to monitor data quality	Use the DataWorks Data Quality service to monitor dirty data that is generated when the periodic ETL operations are performed. If dirty data is detected, the node execution is blocked to prevent the dirty data from spreading.	Learn how to use the DataWorks Data Quality service to configure monitoring rules to monitor the data quality of tables generated by DataWorks nodes. This ensures that the dirty data generated during the ETL process can be detected at the earliest opportunity and effectively prevents the dirty data from spreading downstream.
Visualize data on a dashboard	Use the DataWorks DataAnalysis service to perform user profile analysis on final result tables. For example, you can analyze the geographical distribution of users and the rankings of the number of registered users in different provinces and cities.	Learn how to visualize data on a dashboard by using DataWorks.

Expectations

After you perform the experiment, you can understand the main features of DataWorks.
After you perform this experiment, you can independently complete common data-related tasks, such as data collection, data development, and task O&M in DataWorks.

Duration

If you learn the experiment online, the experiment may require approximately 1 hour to complete.

Costs

You may be charged fees when you run the experiment. The lifecycle of tables created in this experiment is set to 14 days by default to reduce costs. To prevent the fees of long-term node scheduling, after you complete the experiment, you can configure the Validity Period parameter for the related node or freeze the root node of the workflow to which the node belongs. The root node is the zero load node named workshop_start.

Q&A

If you have questions during the workshop, join the DingTalk group for consultation.