Data Acquisition with DataWorks

This is the second section of the DataWorks workshop. In this section of the workshop you will work on a massive log data analysis task. By doing so, you will learn how to synchronize data from different data sources to MaxCompute, how to quickly trigger task runs, and how to view task logs.

Prerequisites

Before you begin the steps outlined in this section of the workshop, you need to make sure you have an valid Alibaba Cloud account.

Activating MaxCompute

If you have already activated MaxCompute, you can skip this step and go on to create the project space directly.

1. Log in to the Alibaba Cloud website, click Log in in the upper-right corner to fill in your Alibaba Cloud account and password.

2. Select Products and then Analytics & Big Data and last MaxComputute, and then go to the MaxCompute product details page.

4. Click Start now.

5. Select Pay-As-You-Go, click Buy Now.

Creating a Workspace

1. Log on to the DataWorks console by using a primary account.

2. You can create a workspace in one of two ways.
(1). On the console Overview page, go to Commonly Used Features > Create Workspace.

(2). On the console Workspace page, select region, and then click Create Workspace in the upper right corner.

3. Fill in the configuration items in the Create Workspace dialog box. Select a region and a calculation engine service.
Note: If you have not purchase the relevant services in the region, it is directly display that there is no service available in the Region. The data analytics, O&M, and administration are selected by default.

4. Configure the basic information and advanced settings for the new project, and click Create Workspace.

Note the following:

(1). The workspace name needs to begin with a letter or underline, and can only contain letters, underscores, and numbers.
(2). The workspace name is globally unique, it is recommended that you use your own easy-to-distinguish name as the project space name for this lab.

5. Once the workspace has been created successfully, you can select the Workspace page to Data Analytics after viewing the workspace space.

Creating a Data Source

Before you begin, note that, based on the scenario simulated by this lab, you need to distribute to create both the OSS and ApsaraDB for RDS data sources.

Create a new OSS data source
Select the Data Integration > Data Source Page, and click Add Data Source.

Select the data source type as OSS, with other configuration items as follows.

Here are the above parameters you need to enter:

Endpoint: http://oss-cn-shanghai-internal.aliyuncs.com
bucket: dataworks-workshop

Click Test Connectivity, and after the connectivity test passes, click Finish to save the configuration.

Note: If the test connectivity fails, check your AK and the region in which the item is located. It is recommended to create the project in China (Shanghai), and other regions do not guarantee network access.

Add RDS Data Source

Select the Data Integration > Data SourcePage, and click Add Data Source.
Select the data source type as MySQL, and fill in the configuration information.

Here are the above parameters you need to enter:

Data source type: ApsaraDB for RDS
Data source name: rds_workshop_log
Data source description: RDS log data synchronization
RDS instance name: rm-2ev0681lc7042g16u
RDS instance buyer ID: 5600815724958382
Database name: workshop
Username/Password: workshop/workshop#2017

Click Test Connectivity, and after the connectivity test passes, click Finish to save the configuration.

Creating a Business Flow

1. Right-click Business Flow under Data Analytics, select Create Business Flow.

2. Fill in the Business Flow name and description.

3. Click Create to complete the creation of the Business Flow.

4. Enter the Business Flow Development Panel and drag a virtual node and two data sync nodes (oss_datasync and rds_datasync) into the Panel.

5. Drag the connection to set the workshop_start node to the upstream of both data synchronization nodes.

Configuring `workshop_start` Task

Since the new version sets the input and output nodes for each node, you need to set an input for the workshop_start node, the virtual node in the Business Flow can be set to the upstream node as the project root node, the project root node is generally named project name _ root.

You can configure it by clicking Schedule. When the task configuration is complete, click Save.

Creating a Table

1. Right-click Table and choose Create Table.

2. Type in Table Name (ods_raw_log_d and ods_user_info_d) for the OSS logs and RDS respectively.

3. Type in your Table Alias and choose Partitioned Table.

4. Type in the field and partition information, click Submit to Development Environment and Submit to Production Environment.

You can also click DDL Mode, use the following SQL statements to create tables.

//Create a target table for oss logs
CREATE TABLE IF NOT EXISTS  ods_raw_log_d (
  col STRING
)
PARTITIONED BY (
  dt STRING
);

//Creates a target table for RDS
CREATE TABLE IF NOT  EXISTS ods_user_info_d (
  uid STRING COMMENT 'User ID',
  gender STRING COMMENT 'Gender',
  age_range STRING COMMENT 'Age range',
  zodiac STRING COMMENT 'Zodiac'
)
PARTITIONED BY (
  dt STRING
);

5. Click Submit to Development Environment and Submit to Production Environment. You can configure both of the tables in this way.

Configuring the Data Synchronization Task

Configure the oss_datasync node

Double-click the oss_datasync node to go to the node configuration page.
Select a data source.

Select the data source you made in the OSS data source.

Here are the above parameters you need to enter:

Data source: oss_workshop_log
Object Prefix: /user_log.txt
Column Separator: |

Select data destination

Select the data destination is ods_raw_log_d in the odps_first data source. Both partition information and cleanup rules take the system default, the default configuration of the partition is ${bizdate}.

Configure the field mapping, connect the fields that you want to synchronize.

Configure Transmission Rate with a maximum operating rate of 10 Mb/s.

Verify that the current task is configured and can be modified. After the confirmation is correct, click Save in the upper left corner.
Closes the current task and returns to the Business Flow configuration panel.

Configure the rds_datasync node Node

Double-click the rds_datasync node to go to the node configuration page.
Select a data source.

Select the data source that is located in the MySQL data source rds_workshop_log, and the table is named as ods_user_info_d, the split key uses the default to generate columns.

Select data destination

Select the data destination ods_user_info_d in the data source named odps_first. Both partition information and cleanup rules take the system default, the default configuration of the partition is ${bizdate}.

Configure the field mapping, default in association with the name mapping.

Configure Transmission Rate with a maximum operating rate of 10 Mb/s.

Verify that the current task is configured and can be modified. After the confirmation is correct, click Save in the upper left corner.

Closes the current task and returns to the Business Flow configuration panel.

Submiting Business Flow Tasks

1. Click Submit to submit the current Business Flow.

2. Select the nodes in the submit dialog box, and check the Ignore Warning on I/O Inconsistency, click Submit.

Running Workflow Tasks

1. Click Run.

During a task run, you can view the run status.

2. Right-click the rds_datasync task and select View Log.

When the following words appear in the log, it indicates that the synchronization task runs successfully and synchronizes a batch of data.

3. Right-click the oss_datasync task and select View Log. The confirmation method is consistent with the rds_datasync task.

Checking If the Data Imported into MaxCompute

1. Click temporary query in the left-hand navigation bar.

2. Select New > ODPS SQL.

3. Write and execute SQL statement to check the entries imported into ods_raw_log_d.

4. Also write and execute SQL statements to view the number of imported ods_user_info_d records.

Note: The SQL statement is as follows, where the partition columns need to be updated to the business date, if the task runs on a date of 20180717, the business date is 20180716.

Check that data was written to MaxCompute

select count(*) from ods_raw_log_d where dt=business date;
select count(*) from ods_user_info_d where dt=business date;

Next step

Now that you've learned how to synchronize the log data, complete the data acquisition, you can continue with the next tutorial in this workshop, which is Data Processing with DataWorks .

Community

Data Acquisition with DataWorks

Prerequisites

Activating MaxCompute

Creating a Workspace

Creating a Data Source