Spark SQL nodes allow you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs. DataWorks provides CDH Spark SQL nodes that you can use to develop and periodically schedule CDH Spark SQL tasks and integrate the tasks with other types of tasks. This topic describes how to create and use a CDH Spark SQL node.
Prerequisites
A workflow is created in DataStudio.
Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.
An Alibaba Cloud CDH cluster is created and registered to DataWorks.
Before you create a CDH node and use the CDH node to develop a CDH task in DataWorks, you must register a CDH cluster to a DataWorks workspace. For more information, see Register a CDH or CDP cluster to DataWorks.
A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.
Limits
Tasks on CDH Spark SQL nodes can be run on serverless resource groups or old-version exclusive resource groups for scheduling. We recommend that you run tasks on serverless resource groups.
Step 1: Create a CDH Spark SQL node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
On the DataStudio page, find the desired workflow, right-click the workflow name, and then choose
.In the Create Node dialog box, configure the Name parameter and click Confirm. Then, you can use the created node to develop and configure CDH Spark SQL tasks.
Step 2: Develop a CDH Spark SQL task
(Optional) Select a CDH cluster
If multiple Cloudera's Distribution including Apache Hadoop (CDH) clusters are registered to the current workspace, you must select a cluster from the Engine Instance CDH drop-down list based on your business requirements. If only one CDH cluster is registered to the current workspace, the CDH cluster is automatically used for development.
Develop SQL code
Develop SQL code: Simple example
In the code editor on the configuration tab of the CDH Spark SQL node, write task code.
In this example, you can create the test_lineage_table_f1
and test_lineage_table_t2
tables in the test_spark
database and copy data from the test_lineage_table_f1
table to the test_lineage_table_t2
table. Sample code:
This example is for reference only. You can write code based on your business requirements.
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_f1 (`id` BIGINT, `name` STRING)
PARTITIONED BY (`ds` STRING);
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_t2 AS SELECT * FROM test_spark.test_lineage_table_f1;
INSERT into test_spark.test_lineage_table_t2 SELECT * FROM test_spark.test_lineage_table_f1;
Develop SQL code: Use scheduling parameters
DataWorks provides scheduling parameters whose values are dynamically replaced in the code of a task based on the configurations of the scheduling parameters in periodic scheduling scenarios. You can define variables in the task code in the ${Variable}
format and assign values to the variables in the Scheduling Parameter section of the Properties tab. For information about the supported formats of scheduling parameters and how to configure scheduling parameters, see Supported formats of scheduling parameters and Configure and use scheduling parameters.
Sample code:
SELECT '${var}'; -- You can assign a specific scheduling parameter to the var variable.
(Optional) Configure advanced parameters
In the code editor on the configuration tab of the CDH Spark SQL node, click Advanced Settings in the right-side navigation pane to configure advanced parameters. The following code provides an example:
"spark.driver.memory": "2g"
: specifies the memory size allocated to the Spark driver node."spark.yarn.queue": "haha"
: specifies the queue of Yarn to which the application is submitted.
For more information about how to configure advanced parameters, see Spark Configuration.
Step 3: Configure task scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
Step 4: Debug task code
You can perform the following operations to check whether the task is configured as expected based on your business requirements:
Optional. Select a resource group and assign custom parameters to variables.
Click the icon in the top toolbar of the configuration tab of the node. In the Parameters dialog box, select a resource group for scheduling that you want to use to debug and run task code.
If you use scheduling parameters in your task code, assign the scheduling parameters to variables as values in the task code for debugging. For more information about the value assignment logic of scheduling parameters, see Debugging procedure.
Save and run task code.
In the top toolbar, click the icon to save task code. Then, click the icon to run task code.
Optional. Perform smoke testing.
You can perform smoke testing on the task in the development environment to check whether the task is run as expected when you commit the task or after you commit the task. For more information, see Perform smoke testing.
Step 5: Commit and deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy tasks.
What to do next
Task O&M: After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.
View lineages: After you commit and deploy the task, you can view the lineages of the task on the Data Map page. For example, you can view the source of the original data and the database to which the table data flows. Then, you can analyze the impacts of different levels of lineages based on your business requirements. For more information, see View lineages.