Use DataWorks to develop tasks based on Lindorm - DataWorks

Lindorm Distributed Processing System (LDPS) is compatible with Cloudera's Distribution Including Apache Hadoop (CDH). You can register a CDH cluster to DataWorks and configure the connection information of LDPS to perform operations such as interactive SQL queries, SQL task development, and JAR task execution in DataWorks based on LDPS. This topic describes how to register a CDH cluster to DataWorks to access LDPS and develop, schedule, and perform O&M on various types of tasks based on LDPS.

Background information

LDPS is a distributed computing service that is provided based on the cloud-native architecture. It supports computing models of Community Edition, is compatible with the Spark API, and deeply integrates the features of the Lindorm storage engine. LDPS can fully use the features of underlying storage and indexing capabilities to complete distributed tasks in an efficient manner. LDPS is suitable for scenarios such as the production of large amounts of data, interactive analytics, computational learning, and graph computing.

Prerequisites

Before you develop tasks in DataWorks based on LDPS, make sure that the following operations are performed:

A Lindorm instance is created and LDPS is activated for the instance. You need to develop tasks in DataWorks based on LDPS. For more information about how to activate LDPS, see Activate LDPS and modify the configurations.
A CDH cluster is created and registered to DataWorks. DataWorks allows you to access LDPS by registering a CDH cluster to DataWorks. For more information about how to register a CDH cluster to DataWorks, see Register a CDH or CDP cluster to DataWorks.
When you register a CDH cluster, you must specify the connection information of LDPS for the CDH cluster and set the version of the CDH cluster to 6.3.2. In this case, you need to configure only the HiveServer2 and Metastore parameters. Other parameters can be left empty.
A workflow is created. Development operations in different types of compute engines are performed based on workflows in DataStudio. You can orchestrate tasks on nodes in workflows based on your business requirements to view dependencies between the tasks. For information about how to create a workflow, see Create a workflow.

Step 1: Develop tasks based on LDPS

This section describes how to develop tasks by executing SQL statements or using a JAR package.

Develop tasks by executing SQL statements (Expand for details)

Orchestrate tasks on nodes

Double-click the created workflow to go to the configuration tab of the workflow. Drag the required node types to the canvas on the right side, configure the basic information about the nodes, and then draw lines to connect the nodes to plan node dependencies. The following figure shows how to orchestrate tasks on nodes.

Zero load node (Vi): serves as a start node in a workflow to schedule tasks in the workflow in a centralized manner.
CDH Hive node: is used to run SQL tasks based on LDPS.

Write task code

In this example, CDH Hive nodes are used to develop SQL tasks based on Spark SQL syntax.

Note

You can use serverless resource groups or old-version exclusive resource groups for scheduling to run tasks on CDH Hive nodes. We recommend that you use serverless resource groups.

Double-click each CDH Hive node in the workflow to go to the configuration tab of the node.
In this example, double-click the created CDH Hive nodes A, B, and C in sequence and configure properties for tasks on the nodes.
Optional. Configure parameters for Spark jobs of LDPS.
LDPS allows you to configure common parameters for Spark jobs, including parameters related to resources, execution, and monitoring. Example: SET spark.executor.cores=2;. You can configure parameters based on your business requirements. For more information, see Configure parameters for jobs.
Note
Statements for parameter configuration of LDPS must be written before statements for SQL task development.
Write task code.
Develop SQL code: Simple example
In the code editor on the configuration tab of each CDH Hive node, write task code. The following sample code provides an example. For information about the SQL syntax supported by Lindorm, see Spark SQL, DataFrames and Datasets Guide.
```
CREATE TABLE test
(
    id    INT
    ,name STRING
);

INSERT INTO test VALUES
        (1,'jack');

SELECT  * from test;
```
Develop SQL code: Use scheduling parameters
DataWorks provides scheduling parameters whose values are dynamically replaced in the code of a task based on the configurations of the scheduling parameters in periodic scheduling scenarios. You can define variables in the task code in the ${Variable} format and assign values to the variables in the Scheduling Parameter section of the Properties tab.
- For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.
- For information about the SQL syntax supported by Lindorm, see Spark SQL, DataFrames and Datasets Guide.
```
select '${var}'; -- You can assign a specific scheduling parameter to the var variable.
```
Sample code for tasks on nodes
- Sample code for a task on Node A
```
CREATE TABLE IF NOT EXISTS tableA (
  id INT,
  name STRING,
  data STRING
)
USING parquet
PARTITIONED BY (partition_date DATE);

INSERT OVERWRITE TABLE tableA PARTITION (partition_date='${var}')
VALUES (1, 'Alice', 'Sample data 1'), (2, 'Bob', 'Sample data 2');
```
- Sample code for a task on Node B
```
CREATE TABLE  IF NOT EXISTS tableB (
    id INT,
  name STRING,
  data STRING
)
USING parquet
PARTITIONED BY (partition_date DATE);

INSERT OVERWRITE TABLE  tableB PARTITION (partition_date)
SELECT * FROM tableA WHERE partition_date='${var}';
```
- Sample code for a task on Node C
```
CREATE TABLE IF NOT EXISTS tableC
(
    id    INT
    ,name STRING
    ,data STRING
)
USING parquet
PARTITIONED BY (partition_date DATE);

INSERT OVERWRITE TABLE tableC PARTITION (partition_date)
SELECT * FROM tableB WHERE partition_date='${var}';
```

Develop tasks by using a JAR package (Expand for details)

In this example, a DataWorks Shell node is used to submit JAR jobs to the Lindorm Spark compute engine by running curl commands.

Template for submitting a JAR job

# Submit a job to Lindorm. 
curl --location --request POST 'http://ld-uf6y6d74hooeb****-proxy-ldps.lindorm.aliyuncs.com:10099/api/v1/lindorm/jobs/xxxxxx' --header 'Content-Type:application/json' --data '{
"owner":"root",
"name":"LindormSQL",
"mainResourceKind":"jar",
"mainClass":"your_project_main_class",
"mainResource":"hdfs:///ldps-user-resource/ldps-ldps-lindorm-spark-examples-1.0-SNAPSHOT.jar",
"mainArgs":[],
"conf":{
    }
}'

The following table describes the key parameters.

Parameter	Description
URL	The endpoint for submission of JAR jobs in LDPS. You can obtain the endpoint in the Lindorm console. For more information, see View endpoints. This parameter is configured in the format of `http://ld-uf6y6d74hooeb**-proxy-ldps.lindorm.aliyuncs.com:10099/api/v1/lindorm/jobs/xxxxxx'`. `http://ld-uf6y6d74hooeb**-proxy-ldps.lindorm.aliyuncs.com:10099`: indicates a virtual private cloud (VPC) endpoint. `xxxxxx`: indicates a token.
mainClass	The class that is used as the entry point for your program in the JAR job.
mainResource	The path where the JAR package is stored in Hadoop Distributed File System (HDFS).
mainArgs	The parameter that is passed to the mainClass parameter.
conf	The Spark system parameters. For more information, see Configure parameters for jobs.

JAR job example

Develop a JAR job.
You can follow the instructions that are described in Create a job in Java to develop a JAR job. In addition, the system provides a packaged JAR job named ldps-lindorm-spark-examples-1.0-SNAPSHOT.jar. You can directly use the job for testing.
Upload the JAR package to HDFS.
Log on to the Lindorm console and upload the JAR package to HDFS. For more information, see Upload files in the Lindorm console.

Submit the job and view job details in DataWorks.

Create a Shell node.
Right-click the name of the created workflow and choose Create Node > General > Shell. In the Create Node dialog box, configure the Name parameter and click Confirm.

Write task code on the configuration tab of the node.

Run a cURL command to submit the job. Sample command:

curl --location --request POST http://ld-bp19xymdrwxxxxx-proxy-ldps-pub.lindorm.aliyuncs.com:10099/api/v1/lindorm/jobs/xxxx --header "Content-Type:application/json" --data '{
"owner":"root",
"name":"LindormSQL",
"mainResourceKind":"jar",
"mainClass":"com.aliyun.lindorm.ldspark.examples.SimpleWordCount",
"mainResource":"hdfs:///ldps-user-resource/ldps-lindorm-spark-examples-1.0-SNAPSHOT.jar",
"mainArgs":[],
"conf":{
}
}'

View job details. Sample command:

curl --request GET 'http://ld-bp19xymdrwxxxxx-proxy-ldps-pub.lindorm.aliyuncs.com:10099/api/v1/lindorm/jobs/xxxx'

Step 2: Configure task scheduling properties

If you want to periodically run tasks on created nodes, click Properties in the right-side navigation pane of the node configuration tab to configure the scheduling information of the nodes based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit a task on the node.

Step 3: Debug task code

You can perform the following operations to check whether a task is configured as expected.

Note

You can also click the icon on the configuration tab of a workflow to debug code of all tasks in the workflow.

Optional. Select a resource group and assign custom parameters to variables.
- Click the icon in the top toolbar of the configuration tab of a node. In the Parameters dialog box, select a resource group for scheduling that you want to use to debug and run task code.
- If you use scheduling parameters in your task code, assign the scheduling parameters to variables as values in the task code for debugging. For more information about the value assignment logic of scheduling parameters, see Debugging procedure.
Save and run task code.
In the top toolbar, click the icon to save task code and click the icon to run task code.
Optional. Perform smoke testing.
You can perform smoke testing on the task in the development environment to check whether the task is run as expected when you commit the task or after you commit the task. For more information, see Perform smoke testing.

Step 4: Commit and deploy tasks

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

Note

You can also click the icon on the configuration tab of a workflow to commit all tasks in the workflow.

Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note
- You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
- You can use the code review feature to ensure the code quality of tasks and prevent execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

Task O&M

After you commit and deploy a task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.

Data management

In DataWorks Data Map, you can collect the metadata of tables that are created by using LDPS to DataWorks for management in a centralized manner.

Collect metadata: Before you can view metadata in Data Map, you must first create a metadata crawler. For more information, see Create and manage CDH Hive sampling crawlers.
View metadata: In Data Map, you can view the basic information and field information of metadata. For more information, see the "View the details of a table" section of the MaxCompute table data topic.

Background information

Prerequisites

Step 1: Develop tasks based on LDPS

Develop tasks by executing SQL statements (Expand for details)

Orchestrate tasks on nodes

Write task code

Develop SQL code: Simple example

Develop SQL code: Use scheduling parameters

Sample code for tasks on nodes

Develop tasks by using a JAR package (Expand for details)

Template for submitting a JAR job

JAR job example

Step 2: Configure task scheduling properties

Step 3: Debug task code

Step 4: Commit and deploy tasks

What to do next

Task O&M

Data management