Lindorm Distributed Processing System (LDPS) is compatible with Cloudera's Distribution Including Apache Hadoop (CDH). You can register a CDH cluster to DataWorks and configure the connection information of LDPS to perform operations such as interactive SQL queries, SQL task development, and JAR task execution in DataWorks based on LDPS. This topic describes how to register a CDH cluster to DataWorks to access LDPS and develop, schedule, and perform O&M on various types of tasks based on LDPS.
Background information
LDPS is a distributed computing service that is provided based on the cloud-native architecture. It supports computing models of Community Edition, is compatible with the Spark API, and deeply integrates the features of the Lindorm storage engine. LDPS can fully use the features of underlying storage and indexing capabilities to complete distributed tasks in an efficient manner. LDPS is suitable for scenarios such as the production of large amounts of data, interactive analytics, computational learning, and graph computing.
Prerequisites
Before you develop tasks in DataWorks based on LDPS, make sure that the following operations are performed:
A Lindorm instance is created and LDPS is activated for the instance. You need to develop tasks in DataWorks based on LDPS. For more information about how to activate LDPS, see Activate LDPS and modify the configurations.
A CDH cluster is created and registered to DataWorks. DataWorks allows you to access LDPS by registering a CDH cluster to DataWorks. For more information about how to register a CDH cluster to DataWorks, see Register a CDH or CDP cluster to DataWorks.
When you register a CDH cluster, you must specify the connection information of LDPS for the CDH cluster and set the version of the CDH cluster to 6.3.2. In this case, you need to configure only the HiveServer2 and Metastore parameters. Other parameters can be left empty.
A workflow is created. Development operations in different types of compute engines are performed based on workflows in DataStudio. You can orchestrate tasks on nodes in workflows based on your business requirements to view dependencies between the tasks. For information about how to create a workflow, see Create a workflow.
Step 1: Develop tasks based on LDPS
This section describes how to develop tasks by executing SQL statements or using a JAR package.
Develop tasks by executing SQL statements (Expand for details)
Develop tasks by using a JAR package (Expand for details)
Step 2: Configure task scheduling properties
If you want to periodically run tasks on created nodes, click Properties in the right-side navigation pane of the node configuration tab to configure the scheduling information of the nodes based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit a task on the node.
Step 3: Debug task code
You can perform the following operations to check whether a task is configured as expected.
You can also click the icon on the configuration tab of a workflow to debug code of all tasks in the workflow.
Optional. Select a resource group and assign custom parameters to variables.
Click the icon in the top toolbar of the configuration tab of a node. In the Parameters dialog box, select a resource group for scheduling that you want to use to debug and run task code.
If you use scheduling parameters in your task code, assign the scheduling parameters to variables as values in the task code for debugging. For more information about the value assignment logic of scheduling parameters, see Debugging procedure.
Save and run task code.
In the top toolbar, click the icon to save task code and click the icon to run task code.
Optional. Perform smoke testing.
You can perform smoke testing on the task in the development environment to check whether the task is run as expected when you commit the task or after you commit the task. For more information, see Perform smoke testing.
Step 4: Commit and deploy tasks
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
You can also click the icon on the configuration tab of a workflow to commit all tasks in the workflow.
Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
What to do next
Task O&M
After you commit and deploy a task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.
Data management
In DataWorks Data Map, you can collect the metadata of tables that are created by using LDPS to DataWorks for management in a centralized manner.
Collect metadata: Before you can view metadata in Data Map, you must first create a metadata crawler. For more information, see Create and manage CDH Hive sampling crawlers.
View metadata: In Data Map, you can view the basic information and field information of metadata. For more information, see the "View the details of a table" section of the MaxCompute table data topic.