DataWorks is a platform that can be used to process and analyze large amounts of data in offline mode. It uses MaxCompute as a computing and storage engine to offer fully hosted services for visual workflow development, scheduling, and O&M. In DataWorks, tasks can be hosted and scheduled by time or dependency. This topic describes how to use DataWorks to schedule DLA tasks.
Purpose
Task scheduling requires dependencies between tasks. In this topic, two DLA tasks are created in DataWorks to demonstrate the dependencies between tables and tasks.
Task 1
DataWorks executes the
o_orderstatus = 'F'
statement to filter the finished orders in the orders table and then writes the finished orders into thefinished_orders
table.Task 2
DataWorks executes the
o_totalprice > 10000
statement to filter the orders whose total price is greater than USD 10,000 from the finished_orders table, and then writes these orders into thehigh_value_finished_orders
table.
The orders table is stored in the oss://dlaossfile1/dla/
path, of which dla indicates the object and dlaossfile1 indicates the bucket. You can click the orders.txt file in the preceding path to download source data. The empty files finished_orders.txt and high_value_finished_orders.txt are stored in the oss://dlaossfile1/dla/finished_orders/
path.
Prerequisites
The DLA, DataWorks, and OSS services are activated and deployed in the same region. In this topic, the three services are all deployed in the China (Hangzhou) region.
A workspace is created. For more information, see Create a workspace. In this topic, the liujing_dataworks_test workspace is created.
An endpoint is created in DLA. For more information, see Set an endpoint.
Procedure
Create an OSS schema in DLA by using the following statement:
CREATE SCHEMA dataworks_demo with DBPROPERTIES( CATALOG = 'oss', LOCATION = 'oss://dlaossfile1/dla/');
Create the orders, finished_orders, and high_value_finished_orders table for the OSS file in DLA by using the following statements:
Statement to create the orders table
CREATE EXTERNAL TABLE IF NOT EXISTS orders ( O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dlaossfile1/dla/';
Statement to create the finished_orders table
CREATE EXTERNAL TABLE IF NOT EXISTS finished_orders ( O_ORDERKEY INT, O_TOTALPRICE DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dlaossfile1/dla/finished_orders/';
Statement to create the high_value_finished_orders table
CREATE EXTERNAL TABLE IF NOT EXISTS high_value_finished_orders ( O_ORDERKEY INT, O_TOTALPRICE DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://dlaossfile1/dla/finished_orders/';
Add a DLA data source in DataWorks.
Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the page that appears, find your workspace and click Data Integration in the Actions column.
On the Welcome to Data Integration page, click the Connection icon. In the left-side navigation pane, click Data Source. On the page that appears, click New data source in the upper-right corner. In the Add data source dialog box, click the Data Lake Analytics(DLA) icon in Big Data Storage.
In the Add Data Lake Analytics(DLA) data source dialog box.
The following table describes the parameters that you must specify.
Data Source Name
The name of the data source. We recommend that you specify an informative name for easy management.
Description
The description of the data source. This parameter is optional.
Connection Url
The DLA endpoint in the
Address:Port
format. For more information about how to obtainAddress:Port
, see Create an endpoint.Database
The name of the Object Storage Service (OSS) database that you create in DLA. In this topic, set this parameter to dataworks_demo.
User name
The username that is used to access DLA.
Password
The password that is used to access DLA.
Modify the IP address whitelist for DLA in DataWorks.
DataWorks allows you to add only the DLA data sources whose IP addresses are specified in an IP address whitelist of DataWorks. Therefore, you must add the IP addresses or Classless Inter-Domain Routing (CIDR) blocks of your DLA region to the IP address whitelist.
China (Hangzhou)
100.64.0.0/8,11.193.102.0/24,11.193.215.0/24,11.194.110.0/24,11.194.73.0/24,118.31.157.0/24,47.97.53.0/24,11.196.23.0/24,47.99.12.0/24,47.99.13.0/24,114.55.197.0/24,11.197.246.0/24,11.197.247.0/24
China (Shanghai)
11.193.109.0/24,11.193.252.0/24,47.101.107.0/24,47.100.129.0/24,106.15.14.0/24,10.117.28.203,10.117.39.238,10.143.32.0/24,10.152.69.0/24,10.153.136.0/24,10.27.63.15,10.27.63.38,10.27.63.41,10.27.63.60,10.46.64.81,10.46.67.156,11.192.97.0/24,11.192.98.0/24,11.193.102.0/24,11.218.89.0/24,11.218.96.0/24,11.219.217.0/24,11.219.218.0/24,11.219.219.0/24,11.219.233.0/24,11.219.234.0/24,118.178.142.154,118.178.56.228,118.178.59.233,118.178.84.74,120.27.160.26,120.27.160.81,121.43.110.160,121.43.112.137,100.64.0.0/8
China (Shenzhen)
100.106.46.0/24,100.106.49.0/24,10.152.27.0/24,10.152.28.0/24,11.192.91.0/24,11.192.96.0/24,11.193.103.0/24,100.64.0.0/8,120.76.104.0/24,120.76.91.0/24,120.78.45.0/24
China (Hong Kong)
10.152.162.0/24,11.192.196.0/24,11.193.11.0/24,100.64.0.0/8,11.192.196.0/24,47.89.61.0/24,47.91.171.0/24,11.193.118.0/24,47.75.228.0/24
Singapore (Singapore)
100.106.10.0/24,100.106.35.0/24,10.151.234.0/24,10.151.238.0/24,10.152.248.0/24,11.192.153.0/24,11.192.40.0/24,11.193.8.0/24,100.64.0.0/8,100.106.10.0/24,100.106.35.0/24,10.151.234.0/24,10.151.238.0/24,10.152.248.0/24,11.192.40.0/24,47.88.147.0/24,47.88.235.0/24,11.193.162.0/24,11.193.163.0/24,11.193.220.0/24,11.193.158.0/24,47.74.162.0/24,47.74.203.0/24,47.74.161.0/24,11.197.188.0/24
Australia (Sydney) Closing Down
11.192.100.0/24,11.192.134.0/24,11.192.135.0/24,11.192.184.0/24,11.192.99.0/24,100.64.0.0/8,47.91.49.0/24,47.91.50.0/24,11.193.165.0/24,47.91.60.0/24
China (Beijing)
100.106.48.0/24,10.152.167.0/24,10.152.168.0/24,11.193.50.0/24,11.193.75.0/24,11.193.82.0/24,11.193.99.0/24,100.64.0.0/8,47.93.110.0/24,47.94.185.0/24,47.95.63.0/24,11.197.231.0/24,11.195.172.0/24,47.94.49.0/24,182.92.144.0/24
US (Silicon Valley)
10.152.160.0/24,100.64.0.0/8,47.89.224.0/24,11.193.216.0/24,47.88.108.0/24
US (Virginia)
11.193.203.0/24,11.194.68.0/24,11.194.69.0/24,100.64.0.0/8,47.252.55.0/24,47.252.88.0/24
Malaysia (Kuala Lumpur)
11.193.188.0/24,11.221.205.0/24,11.221.206.0/24,11.221.207.0/24,100.64.0.0/8,11.214.81.0/24,47.254.212.0/24,11.193.189.0/24
Germany (Frankfurt)
11.192.116.0/24,11.192.168.0/24,11.192.169.0/24,11.192.170.0/24,11.193.106.0/24,100.64.0.0/8,11.192.116.14,11.192.116.142,11.192.116.160,11.192.116.75,11.192.170.27,47.91.82.22,47.91.83.74,47.91.83.93,47.91.84.11,47.91.84.110,47.91.84.82,11.193.167.0/24,47.254.138.0/24
Japan (Tokyo)
100.105.55.0/24,11.192.147.0/24,11.192.148.0/24,11.192.149.0/24,100.64.0.0/8,47.91.12.0/24,47.91.13.0/24,47.91.9.0/24,11.199.250.0/24,47.91.27.0/24
UAE (Dubai)
11.192.107.0/24,11.192.127.0/24,11.192.88.0/24,11.193.246.0/24,47.91.116.0/24,100.64.0.0/8
India (Mumbai)
11.194.10.0/24,11.246.70.0/24,11.246.71.0/24,11.246.73.0/24,11.246.74.0/24,100.64.0.0/8,149.129.164.0/24,11.194.11.0/24
UK (London)
11.199.93.0/24,100.64.0.0/8
Indonesia (Jakarta)
11.194.49.0/24,11.200.93.0/24,11.200.95.0/24,11.200.97.0/24,100.64.0.0/8,149.129.228.0/24,10.143.32.0/24,11.194.50.0/24
China North 2 Ali Gov
If the CIDR block 11.194.116.0/24,100.64.0.0/8 cannot be added, add the following IP addresses: 11.194.116.160, 11.194.116.161, 11.194.116.162, 11.194.116.163, 11.194.116.164, 11.194.116.165, 11.194.116.167, 11.194.116.169, 11.194.116.170, 11.194.116.171, 11.194.116.172, 11.194.116.173, 11.194.116.174, and 11.194.116.175.
After you specify the preceding parameters, click Test connectivity. After the connectivity test is passed, click Complete.
Create a DLA scheduling task in DataWorks.
Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the page that appears, find your workspace and click Data Analytics in the Actions column.
In the left-side pane, right-click Business process and select New business process. In this topic, the dla_test_1 business process is created.
Create a DLA task. In this topic, the finished_orders task is created.
In the left-side pane, unfold Business process and click the finished_orders task. On the finished_orders tab, unfold Customize and click Data Lake Analytics.
NoteYou can repeat this step to create multiple tasks. In this topic, the finished_orders and high_value_finished_orders tasks are created.
Run a DLA scheduling task in DataWorks.
Task 1: DataWorks uses
o_orderstatus = 'F'
to filter the finished orders in the orders table and then writes the finished orders into the finished_orders table.insert into finished_orders select O_ORDERKEY, O_TOTALPRICE from orders where O_ORDERSTATUS = 'F';
Task 2: DataWorks uses
o_totalprice > 10000
to filter the orders whose total price is greater than USD 10,000 from the finished_orders table and writes these orders into the high_value_finished_orders table.insert into high_value_finished_orders select * from finished_orders where O_TOTALPRICE > 10000;
What to do next
Task configuration
In DataWorks, you can configure tasks to be triggered by time or dependency. You can also configure multiple tasks to be executed at a specified time based on specified dependencies.
To run the finished_orders task at 02:00 every day.
To run the high_value_finished_orders task after the finished_orders task is running normally.
Task publish
After you configure tasks, you can publish and maintain these tasks. For more information, see Deploy nodes.