Use a Spark distributed SQL engine in DataWorks to develop Spark SQL jobs - AnalyticDB

If you want to develop Spark SQL jobs in DataWorks, you can register Cloudera's Distribution Including Apache Hadoop (CDH) clusters to DataWorks. You can register a Spark distributed SQL engine of AnalyticDB for MySQL as an SQL-supported execution engine and use the Spark distributed SQL engine to develop Spark SQL jobs in DataWorks.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster and a DataWorks workspace are created. For more information, see Create a Data Lakehouse Edition cluster and Create a workspace.
A job resource group is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster.
A general-purpose resource group or an old-version exclusive resource group for scheduling is purchased. A general-purpose resource group is a new-version resource group, and we recommend that you purchase a general-purpose resource group.
By default, DataWorks resource groups are not connected to the networks of other cloud services after the resource groups are purchased. A CDH cluster must be connected to a specific resource group before you can use the CDH cluster.
Note
- DataWorks releases a new version of resource groups that are used for general purposes, and we recommend that you purchase this type of resource group. General-purpose resource groups are suitable for scenarios in which different task types are used, such as data synchronization and task scheduling. For information about how to purchase a general-purpose resource group, see Create and use a general-purpose resource group. Users who have not activated DataWorks of any edition in the current region can purchase only general-purpose resource groups.
- If you have purchased an old-version exclusive resource group for scheduling, you can also use the resource group to run CDH or CDP tasks. For more information, see Create and use an exclusive resource group for scheduling.

Background information

DataWorks can be connected to CDH clusters. You can register a CDH cluster to DataWorks and perform data development and governance operations, such as task development, task scheduling, data map management (metadata management), and data quality management. You can register a Spark distributed SQL engine of AnalyticDB for MySQL as an SQL-supported execution engine and use the Spark distributed SQL engine to develop Spark SQL jobs in DataWorks.

An AnalyticDB for MySQL cluster allows you to start multiple Spark distributed SQL engines. You can start Spark distributed SQL engines that support different configurations and host the engines on DataWorks based on your business requirements.

Procedure

Step 1: Start a Spark distributed SQL engine

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
In the upper part of the Spark editor, select a job resource group and the SQLEngine job type.

Run the following code in the Spark editor:

Important

DataWorks does not support Spark distributed SQL engines for which usernames and passwords are configured. When you start a Spark distributed SQL engine, do not specify the spark.kubernetes.driverEnv.HIVE_SERVER2_USER and spark.kubernetes.driverEnv.HIVE_SERVER2_PASSWORD parameters.

-- Here is just an example of start Spark SQLEngine. Modify the content and run your spark program.
CONF spark.driver.resourceSpec=medium;
CONF spark.executor.instances=1;
CONF spark.executor.resourceSpec=small;
CONF spark.app.name=Spark SQLEngine;
CONF spark.sql.hive.metastore.version=adb;
CONF spark.adb.sessionTTLSeconds=2147483647;

The following table describes the parameters.

Parameter	Required	Description
spark.app.name	Yes	The name of the Spark application.
spark.sql.hive.metastore.version	Yes	The version of the metastore service. Valid values: adb: the version of AnalyticDB for MySQL. <hive_version>: the version of the Hive Metastore. Note For information about the Hive versions that are supported by Apache Spark, see Spark Configuration. To access a self-managed Hive Metastore, you can replace the default configuration with the standard Apache Spark configuration. For more information, see Spark Configuration.
spark.adb.sessionTTLSeconds	No	The time to live (TTL) of the Spark distributed SQL engine. Unit: seconds. Default value: 1200. The default value specifies that the Spark distributed SQL engine is automatically destroyed 1,200 seconds after the last SQL code block is executed. Important When you restart a Spark distribution SQL engine that is destroyed, a new connection URL is generated for the engine. If you want to use Hive in DataWorks for a long period of time, set the `spark.adb.sessionTTLSeconds` parameter to `2147483647` to prevent the Spark distributed SQL engine from being automatically destroyed due to an extended period of idle time.

For information about other parameters, see Spark application configuration parameters.

Step 2: Obtain the endpoint-based URL of the Spark distributed SQL engine

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
On the Applications tab, find the Spark application that you want to manage and click Details in the Actions column to obtain the endpoint-based URL of the Spark distributed SQL engine, which is the value of the Spark JDBC Public Connect URL parameter.

Step 3: Register the Spark distributed SQL engine to DataWorks

Go to the Management Center page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Management Center in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box that appears, click CDH to go to the cluster registration page.

On the Register CDH Cluster page, configure the parameters that are described in the following table.

Parameter	Description
Display Name Of Cluster	The name of the cluster in DataWorks. The name must be unique within the current tenant.
Cluster Version	The version of the cluster that you want to register. Select CDH 6.3.2.
Cluster Name	The name of the cluster that you want to register. This parameter is used to determine the source of the configuration information that is required when you register a cluster. You can select a cluster that is registered to another DataWorks workspace or create a cluster. If you select a cluster that is registered to another DataWorks workspace, you can reference the configuration information of the cluster. If you create a cluster, you must configure the cluster before you can register the cluster.
Version	The version of Hive. Select 2.1.1.
HiveServer2	The URL of HiveServer2. Enter the endpoint-based URL of the Spark distributed SQL engine that is obtained in Step 2. For more information, see the "Step 2: Obtain the endpoint-based URL of the Spark distributed SQL engine" section of this topic. Important Replace the `{database name}` in the endpoint-based URL with an actual database name.
Metastore	The URL of Hive Metastore. Enter the endpoint-based URL of the Spark distributed SQL engine that is obtained in Step 2. For more information, see the "Step 2: Obtain the endpoint-based URL of the Spark distributed SQL engine" section of this topic. Important Replace the `{database name}` in the endpoint-based URL with an actual database name.
Default Access Identity	The account that you want to use to access the CDH cluster in DataWorks. For more information, see the "Configure the default access identity for the cluster" section of the Register a CDH or CDP cluster to DataWorks topic.

Click Complete.

Step 4: Use the Spark distributed SQL engine to execute SQL statements in DataWorks

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Modeling and Development > DataStudio in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
Create a workflow.
You can use one of the following methods to create a workflow:
- Method 1: Move the pointer over the icon and click Create Workflow.
- Method 2: Right-click Business Flow in the Scheduled Workflow pane and select Create Workflow.
In the Create Workflow dialog box, specify a workflow name and a description and click Create.
In this example, a workflow named Sparktest is created.
Create a node.
1. In the left-side navigation pane, choose Business Flow > Sparktest > CDH to expand the CDH directory.
2. Right-click Data Analytics and choose Create Node > CDH Hive from the shortcut menu.
3. In the Create Node dialog box, configure parameters such as the engine instance and node name.
4. Click Confirm.

After you create the node, enter an SQL statement in the right-side code editor and click the icon to execute the statement.

SHOW TABLES;

Sample result:

+-----------+-----------+-------------+
| namespace | tableName | isTemporary |
+-----------+-----------+-------------+
|    db     |   test    |    false    |
+-----------+-----------+-------------+