This experiment uses the DataWorks and MaxCompute product combo to describe how to use DataWorks.
Quick start
In this experiment, tasks for data synchronization and data processing can be imported with one click through an extract, transform, and load (ETL) workflow template. After the template is imported, you can go to the desired workspace and complete subsequent data quality monitoring and data visualization operations.
Only users that are assigned the Workspace Administrator role can import an ETL workflow template to a desired workspace. For more information about how to assign a Workspace Administrator role, see Manage permissions on workspace-level services.
For information about quick access to an ETL workflow template, go to the Website User Behavior Analysis page.
Experiment introduction
Background information
The features of DataWorks are used to accomplish the following purposes in website user profile analysis scenarios:
Synchronize data.
Process data.
Configure a monitor to monitor data quality.
Visualize data on a dashboard.
Intended audiences
Development engineers, data analysts, and engineers who query data from data warehouses and analyze and gain insights into data, such as product operations engineers
Used services
In this experiment, the following services are used:
DataWorks
In this experiment, DataWorks is used to synchronize and process data, monitor the quality of data, and visualize data. You must activate DataWorks in advance. For more information, see Activate DataWorks.
MaxCompute
MaxCompute is used to implement underlying data processing and computing. You must activate MaxCompute in advance. For more information, see Activate MaxCompute.
ApsaraDB RDS for MySQL
In this experiment, ApsaraDB RDS for MySQL is used to store user information. The basic information of an ApsaraDB RDS for MySQL data source is provided by default. You do not need to separately activate this service.
Object Storage Service (OSS)
In this experiment, the basic information of an OSS data source is provided by default. You do not need to separately activate this service.
Used DataWorks services
In this experiment, the following DataWorks services are used.
Step | Operation | Phase-specific goal |
Use the DataWorks Data Integration service to synchronize the user information that is stored in ApsaraDB RDS for MySQL and the website access logs of users that are stored in OSS to MaxCompute, commit the nodes that are used to process the data to the scheduling system, and then perform periodic synchronization of incremental data by using DataWorks scheduling parameters. | Learn the following items:
| |
Use the DataWorks DataStudio service to split log data into analyzable fields by using methods such as functions and regular expressions. Aggregate the processed log data and the user information tables into basic user profile data. Then, commit the data to the scheduling system and perform periodic data cleansing operations by using DataWorks scheduling parameters. | Learn the following items:
| |
Use the DataWorks Data Quality service to monitor dirty data that is generated when the periodic ETL operations are performed. If dirty data is detected, the node execution is blocked to prevent the dirty data from spreading. | Learn how to use the DataWorks Data Quality service to configure monitoring rules to monitor the data quality of tables generated by DataWorks nodes. This ensures that the dirty data generated during the ETL process can be detected at the earliest opportunity and effectively prevents the dirty data from spreading downstream. | |
Use the DataWorks DataAnalysis service to perform user profile analysis on final result tables. For example, you can analyze the geographical distribution of users and the rankings of the number of registered users in different provinces and cities. | Learn how to visualize data on a dashboard by using DataWorks. |
Expectations
After you perform the experiment, you can understand the main features of DataWorks.
After you perform this experiment, you can independently complete common data-related tasks, such as data synchronization, data development, and task O&M in DataWorks.
Duration
If you learn the experiment online, the experiment may require approximately 1 hour to complete.
Costs
You may be charged fees when you run the experiment. The lifecycle of tables created in this experiment is set to 14 days by default to reduce costs. To prevent the fees of long-term node scheduling, after you complete the experiment, you can configure the Validity Period parameter for the related node or freeze the root node of the workflow to which the node belongs. The root node is the zero load node named workshop_start.
Q&A
If you have questions during the workshop, join the DingTalk group for consultation.