Prepare for simple user profile analysis - DataWorks - Alibaba Cloud Documentation Center

To ensure that you can complete this experiment, we recommend that you familiarize yourself with the business background of the experiment and complete the preparations.

Precautions

Basic user information and website access logs of users that are required for tests in this experiment are provided.
The data in this experiment can be used only for experimental operations in DataWorks, and all the data is manual mock data.

Business background

To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data based on time and location and perform refined operations on website traffic by using basic user profile data.

Step 1: Familiarize yourself with existing business data

Before you perform the operations in this experiment, make sure that you are familiar with existing business data, the format of data in the project, and the basic user profile data schema that is required for business background analysis.

Object Storage Service (OSS) object user_log.txt that stores raw log data

The following code block shows the raw log data that is stored in the OSS object user_log.txt:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The following table describes the valid information that is obtained from the raw log data.

Field	Description
$remote_addr	The IP address of the client that sends the request.
$remote_user	The username that is used to log on to the client.
$time_local	The local time of the server.
$request	The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number.
$status	The status code that is returned by the server.
$body_bytes_sent	The number of bytes returned to the client. The number of bytes of the header is not included in the field value.
$http_referer	The source URL of the request.
$http_user_agent	The information about the client that sends the request, such as the browser used.

MySQL table ods_user_info_d that stores user information

Field	Description
uid	The username.
gender	The gender.
age_range	The age range.
zodiac	The zodiac sign.

Step 2: Analyze and generate the desired basic user profile data schema

Confirm the final data table schema that is displayed in the following table based on the valid data that you obtain after you analyze raw data and based on your business requirements.

Field	Description
uid	The username.
region	The region.
device	The terminal type.
pv	The number of page views.
gender	The gender.
age_range	The age range.
Zodiac	The zodiac sign.

Step 3: Design a data flow and a workflow

Specify table names based on naming conventions:
You can specify table names by referring to naming conventions. In this experiment, the following tables are required.
- Basic user information table: ods_user_info_d
- Raw data table that stores website access logs: ods_raw_log_d
- Fact table that stores website access logs: dwd_log_info_di
- Wide table that stores website access information about users: dws_user_info_all_di
- User profile data table: ads_user_info_1d
Design a data flow based on specifications:
In this experiment, batch synchronization nodes in Data Integration are used to synchronize basic user information and website access logs of users to MaxCompute tables, and ODPS SQL nodes are used to process the synchronized data layer by layer into the final desired user profile data by using the MaxCompute compute engine. The preceding figure shows the logic.
Specify node names based on naming conventions:
To better identify a node and the output table of the node, the node is named the same as the output table of the node in this experiment by default. We recommend that you comply with the same specifications in actual data development.
Design a workflow based on specifications:
- Design a workflow based on a data flow: The relationships between ancestor and descendant nodes in DataWorks follow the one-to-one mapping principle for node and table names. For more information about node naming conventions, see Configure same-cycle scheduling dependencies.
- Enable the workflow management design: The batch synchronization node ods_raw_log_d for OSS and the batch synchronization node ods_user_info_d for MySQL are used to write raw business data to a data warehouse. No data lineage exists at the data warehouse side and the two nodes do not have a unified root node. In this case, you must add a zero load node named workshop_start to manage the user profile analysis workflow to which the nodes belong.
  In this experiment, the scheduling time of the zero load node workshop_start is set to 00:15 every day. This way, the user profile analysis workflow is triggered to run at 00:15 every day.

Note

A node can start to run only after its ancestor nodes finish running, and DataWorks does not actually run a zero load node but directly prompts that the node is successfully run. You can configure the zero load node workshop_start as the ancestor node of the batch synchronization nodes ods_raw_log_d and ods_user_info_d. This way, the zero load node can be used to manage the user profile analysis workflow.

Step 4: Summarize requirements

Phase	Object	Description
Data synchronization	ods_user_info_d	Synchronize the basic user information from the MySQL table ods_user_info_d to the MaxCompute table ods_user_info_d.
Data synchronization	ods_raw_log_d	Synchronize the website access logs of users from the OSS object user_log.txt to the MaxCompute table ods_raw_log_d.
Data processing	dwd_log_info_di	Use methods such as functions and regular expressions to process and split the log data that is stored in the MaxCompute table ods_raw_log_d into analyzable fields. For example, you can use functions and regular expressions to convert IP addresses into regions. Then, write the processed data to the dwd_log_info_di table.
	dws_user_info_all_di	Aggregate the MaxCompute table ods_user_info_d used to store the basic user information and the MaxCompute table dwd_log_info_di used to store the preliminarily processed log data into the dws_user_info_all_di wide table used to store user access information.
	ads_user_info_1d	Process the dws_user_info_all_di wide table that is used to store user access information to generate desired user profile data.
Data quality monitoring	ods_raw_log_d	Configure a rule that monitors whether the number of rows synchronized to the raw log data table is 0 on a daily basis. This rule helps prevent invalid data processing.
	ods_user_info_d	Configure a strong rule that monitors whether the number of rows synchronized to the user information table is 0 on a daily basis, and a weak rule that monitors whether the business primary key in the table is unique on a daily basis. These rules help prevent invalid data processing.
	ads_user_info_1d	Configure a weak rule that monitors the fluctuation of the number of rows in the user information table on a daily basis, and a strong rule that monitors whether the business primary key in the table is unique on a daily basis. These rules are used to observe the fluctuation of daily unique visitors (UVs) and help you learn the application status at the earliest opportunity.
Data visualization	ads_user_info_1d	After data is processed, you can use the DataAnalysis service to display the analysis result of the user profile data. Sample data visualization: Numbers of registered members in different provinces and cities Distribution of page views of members in different age ranges Distribution of page views of members by gender Numbers of page views of members counted by gender and zodiac sign