To ensure that you can complete this experiment, we recommend that you familiarize yourself with the business background of the experiment and complete the preparations.
Precautions
Basic user information and website access logs of users that are required for tests in this experiment are provided.
The data in this experiment can be used only for experimental operations in DataWorks, and all the data is manual mock data.
Business background
To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data based on time and location and perform refined operations on website traffic by using basic user profile data.
Step 1: Familiarize yourself with existing business data
Before you perform the operations in this experiment, make sure that you are familiar with existing business data, the format of data in the project, and the basic user profile data schema that is required for business background analysis.
Object Storage Service (OSS) object user_log.txt that stores raw log data
The following code block shows the raw log data that is stored in the OSS object user_log.txt:
$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];
The following table describes the valid information that is obtained from the raw log data.
Field | Description |
$remote_addr | The IP address of the client that sends the request. |
$remote_user | The username that is used to log on to the client. |
$time_local | The local time of the server. |
$request | The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number. |
$status | The status code that is returned by the server. |
$body_bytes_sent | The number of bytes returned to the client. The number of bytes of the header is not included in the field value. |
$http_referer | The source URL of the request. |
$http_user_agent | The information about the client that sends the request, such as the browser used. |
MySQL table ods_user_info_d that stores user information
Field | Description |
uid | The username. |
gender | The gender. |
age_range | The age range. |
zodiac | The zodiac sign. |
Step 2: Analyze and generate the desired basic user profile data schema
Confirm the final data table schema that is displayed in the following table based on the valid data that you obtain after you analyze raw data and based on your business requirements.
Field | Description |
uid | The username. |
region | The region. |
device | The terminal type. |
pv | The number of page views. |
gender | The gender. |
age_range | The age range. |
Zodiac | The zodiac sign. |
Step 3: Design a data flow and a workflow
Specify table names based on naming conventions:
You can specify table names by referring to naming conventions. In this experiment, the following tables are required.
Basic user information table: ods_user_info_d
Raw data table that stores website access logs: ods_raw_log_d
Fact table that stores website access logs: dwd_log_info_di
Wide table that stores website access information about users: dws_user_info_all_di
User profile data table: ads_user_info_1d
Design a data flow based on specifications:
In this experiment, batch synchronization nodes in Data Integration are used to synchronize basic user information and website access logs of users to MaxCompute tables, and ODPS SQL nodes are used to process the synchronized data layer by layer into the final desired user profile data by using the MaxCompute compute engine. The preceding figure shows the logic.
Specify node names based on naming conventions:
To better identify a node and the output table of the node, the node is named the same as the output table of the node in this experiment by default. We recommend that you comply with the same specifications in actual data development.
Design a workflow based on specifications:
Design a workflow based on a data flow: The relationships between ancestor and descendant nodes in DataWorks follow the one-to-one mapping principle for node and table names. For more information about node naming conventions, see Configure same-cycle scheduling dependencies.
Enable the workflow management design: The batch synchronization node ods_raw_log_d for OSS and the batch synchronization node ods_user_info_d for MySQL are used to write raw business data to a data warehouse. No data lineage exists at the data warehouse side and the two nodes do not have a unified root node. In this case, you must add a zero load node named workshop_start to manage the user profile analysis workflow to which the nodes belong.
In this experiment, the scheduling time of the zero load node workshop_start is set to 00:15 every day. This way, the user profile analysis workflow is triggered to run at 00:15 every day.
A node can start to run only after its ancestor nodes finish running, and DataWorks does not actually run a zero load node but directly prompts that the node is successfully run. You can configure the zero load node workshop_start as the ancestor node of the batch synchronization nodes ods_raw_log_d and ods_user_info_d. This way, the zero load node can be used to manage the user profile analysis workflow.
Step 4: Summarize requirements
Phase | Object | Description |
Data synchronization | ods_user_info_d | Synchronize the basic user information from the MySQL table ods_user_info_d to the MaxCompute table ods_user_info_d. |
ods_raw_log_d | Synchronize the website access logs of users from the OSS object user_log.txt to the MaxCompute table ods_raw_log_d. | |
Data processing | dwd_log_info_di | Use methods such as functions and regular expressions to process and split the log data that is stored in the MaxCompute table ods_raw_log_d into analyzable fields. For example, you can use functions and regular expressions to convert IP addresses into regions. Then, write the processed data to the dwd_log_info_di table. |
dws_user_info_all_di | Aggregate the MaxCompute table ods_user_info_d used to store the basic user information and the MaxCompute table dwd_log_info_di used to store the preliminarily processed log data into the dws_user_info_all_di wide table used to store user access information. | |
ads_user_info_1d | Process the dws_user_info_all_di wide table that is used to store user access information to generate desired user profile data. | |
Data quality monitoring | ods_raw_log_d | Configure a rule that monitors whether the number of rows synchronized to the raw log data table is 0 on a daily basis. This rule helps prevent invalid data processing. |
ods_user_info_d | Configure a strong rule that monitors whether the number of rows synchronized to the user information table is 0 on a daily basis, and a weak rule that monitors whether the business primary key in the table is unique on a daily basis. These rules help prevent invalid data processing. | |
ads_user_info_1d | Configure a weak rule that monitors the fluctuation of the number of rows in the user information table on a daily basis, and a strong rule that monitors whether the business primary key in the table is unique on a daily basis. These rules are used to observe the fluctuation of daily unique visitors (UVs) and help you learn the application status at the earliest opportunity. | |
Data visualization | ads_user_info_1d | After data is processed, you can use the DataAnalysis service to display the analysis result of the user profile data. Sample data visualization:
|