Use a Spark distributed SQL engine to develop Spark SQL jobs - AnalyticDB

If you want to analyze data in real time or use the Java Database Connectivity (JDBC) API to develop Spark SQL jobs, you can use a Spark distributed SQL engine provided by AnalyticDB for MySQL to develop the SQL jobs. Spark distributed SQL engines allow you to use SQL statements to easily analyze, process, and query data and improve SQL execution efficiency. This topic describes how to start and connect to a Spark distributed SQL engine.

Overview

AnalyticDB for MySQL Spark provides managed services for open source Spark distributed SQL engines to develop Spark SQL jobs. A Spark distributed SQL engine is a distributed SQL execution engine that is built into Spark and supports the JDBC API. You can use the JDBC API to connect to a Spark distributed SQL engine from clients that support the standard Hive ThriftServer.

You can start multiple Spark distributed SQL engines. Spark distributed SQL engines have mutually isolated execution environments and independent URLs, but the distributed SQL engines share the same network bandwidth. In this case, the use of multiple Spark distributed SQL engines is not suitable for importing large amounts of data. For more information about Spark distributed SQL engines, see Distributed SQL Engine.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster.
A Java 8 development environment that is used to run a Beeline client and a Java application is installed.
The IP address of the client is added to a whitelist of the AnalyticDB for MySQL cluster. For more information, see Configure a whitelist.

Start a Spark distributed SQL engine

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
In the upper part of the Spark editor, select a job resource group and the SQLEngine job type.

Run the following code in the Spark editor:

-- Here is just an example of start Spark SQLEngine. Modify the content and run your spark program.
CONF spark.adb.eni.enabled=true;
CONF spark.adb.eni.vswitchId=vsw-bp1wh6lp1abs3fi0****;
CONF spark.adb.eni.securityGroupId=sg-bp15ncw8v0624xe****;
CONF spark.driver.resourceSpec=medium;
CONF spark.executor.instances=1;
CONF spark.executor.resourceSpec=small;
CONF spark.app.name=Spark SQLEngine;
CONF spark.sql.hive.metastore.version=adb;
CONF spark.kubernetes.driverEnv.HIVE_SERVER2_USER=AdbSpark14****;
CONF spark.kubernetes.driverEnv.HIVE_SERVER2_PASSWORD=Spark23****;
CONF spark.adb.sessionTTLSeconds=1200;

The following table describes the parameters.

Parameter	Required	Description
spark.adb.eni.enabled	Yes if specific conditions are met	Specifies whether to enable the Elastic Network Interface (ENI) feature. If you use an IP address-based URL to connect to the Spark distributed SQL engine, set this parameter to true. If you use an endpoint-based URL to connect to the Spark distributed SQL engine, you do not need to configure this parameter.
spark.adb.eni.vswitchId	Yes if specific conditions are met	The ID of the vSwitch that is associated with an ENI. If you use an IP address-based URL to connect to the Spark distributed SQL engine, set this parameter to the vSwitch ID of the virtual private cloud (VPC) in which the client resides. If you use an endpoint-based URL to connect to the Spark distributed SQL engine, you do not need to configure this parameter.
spark.adb.eni.securityGroupId	Yes if specific conditions are met	The ID of the security group that is associated with an ENI. If you use an IP address-based URL to connect to the Spark distributed SQL engine, set this parameter to the security group ID of the VPC in which the client resides. If you use an endpoint-based URL to connect to the Spark distributed SQL engine, you do not need to configure this parameter.
spark.app.name	Yes	The name of the Spark application.
spark.sql.hive.metastore.version	Yes	The version of the metastore service. Valid values: adb: the version of AnalyticDB for MySQL. <hive_version>: the version of the Hive Metastore. Note For information about the Hive versions that are supported by Apache Spark, see Spark Configuration. To access a self-managed Hive Metastore, you can replace the default configuration with the standard Apache Spark configuration. For more information, see Spark Configuration.
spark.kubernetes.driverEnv.HIVE_SERVER2_USER	No	The username of the Spark distributed SQL engine. The username must contain uppercase letters, lowercase letters, and digits. No limit is imposed on the length of the username.
spark.kubernetes.driverEnv.HIVE_SERVER2_PASSWORD	No	The password of the Spark distributed SQL engine. The password must contain uppercase letters, lowercase letters, and digits. No limit is imposed on the length of the password.
spark.adb.sessionTTLSeconds	No	The time to live (TTL) of the Spark distributed SQL engine. Unit: seconds. Default value: 1200. The default value specifies that the Spark distributed SQL engine is automatically destroyed 1,200 seconds after the last SQL code block is executed. Note After the Spark distributed SQL engine is destroyed, the status of the Spark JAR job changes to Completed.

For information about other parameters, see Spark application configuration parameters.

Click Run Now.
Important
You can view the status of the Spark JAR job on the Applications tab. You can connect to the Spark distributed SQL engine only when the Spark JAR job is in the Running state.

Connect to a Spark distributed SQL engine

You can connect to a Spark distributed SQL engine from clients that support the standard Hive ThriftServer by using an IP address-based URL or an endpoint-based URL. For information about the clients, see HiveServer2 Clients.

Usage notes for connecting to a Spark distributed SQL engine by using an IP address-based URL or an endpoint-based URL:

(Recommended) Connect to a Spark distributed SQL engine by using an IP address-based URL: If you set the spark.adb.eni.enabled parameter to true to enable the ENI feature and the client and the ENI reside in the same VPC, you can use an IP address-based URL to connect to the Spark distributed SQL engine. In this case, you must obtain the IP address-based URL of the Spark distributed SQL engine. If the client resides in another VPC, you must modify the vSwitch ID of the ENI that is specified by the spark.adb.eni.vswitchId parameter and restart the Spark distributed SQL engine.
Connect to a Spark distributed SQL engine by using an endpoint-based URL: If the client and the AnalyticDB for MySQL cluster reside in the same VPC or you want to establish a connection over the Internet, you can use an endpoint-based URL to connect to the Spark distributed SQL engine. In this case, you must obtain the endpoint-based URL of the Spark distributed SQL engine.

Step 1: Obtain the connection URL of the Spark distributed SQL engine

Obtain the IP address-based URL of the Spark distributed SQL engine

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
On the Applications tab, find the Spark application that you want to manage and click Log in the Actions column to obtain the IP address-based URL of the Spark distributed SQL engine.
```
jdbc:hive2://172.XX.XX.XX:10001/<db>;transportMode=http;httpPath=s202401301607hz**********.cliservice
```

Obtain the endpoint-based URL of the Spark distributed SQL engine

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
On the Applications tab, find the Spark application that you want to manage and click Details in the Actions column to obtain the endpoint-based URL of the Spark distributed SQL engine.
- If you want to connect to the Spark distributed SQL engine over the Internet, you must use the endpoint-based URL that corresponds to the Spark JDBC Public Connect URL parameter.
- If you want to connect to the Spark distributed SQL engine over an VPC, you must use the endpoint-based URL that corresponds to the Spark JDBC VPC Connect URL parameter.
Important
- The first time you start a Spark distributed SQL engine, the system requires approximately 1 minute to initialize the network. After the network is initialized, you can obtain the endpoint-based URL.
- An endpoint-based URL cannot be created when you apply for or release a public endpoint. If an endpoint-based URL cannot be created after you start the Spark distributed SQL engine multiple times, submit a ticket.

Step 2: Connect to the Spark distributed SQL engine

In this example, a Beeline client and a Java application are used to connect to the Spark distributed SQL engine.

Use a Beeline client to connect to the Spark distributed SQL engine

Connect to the Spark distributed SQL engine.
Syntax:
```
! connect <Connection URL> <Username> <Password>
```
- Connection URL: the connection URL of the Spark distributed SQL engine that is obtained in Step 1.
- Username: the username of the Spark distributed SQL engine.
- Password: the password of the Spark distributed SQL engine.
Example:
```
!connect jdbc:hive2://jdbc:hive2://amv-bp1c3em7b2e****-spark.ads.aliyuncs.com:10001/adb_test;transportMode=http;httpPath=s202401301607hz**********.cliservice AdbSpark14**** Spark23****
```
Important
- If you set the spark.sql.hive.metastore.version parameter to adb when you start the Spark distributed SQL engine, you must include an existing database of the AnalyticDB for MySQL cluster in the connection URL. If you do not include an existing database in the connection URL, the default database is used and the connection fails.
- When network jitter occurs, the ThriftServer HTTP protocol returns the HTTP status code 509, which indicates that access is denied, and does not automatically retry the connection. You must rerun the command to connect to the Spark distributed SQL engine.
- You cannot modify or delete the transportMode=http;httpPath=s202401301607hzdf852**********.cliservice parameter that is automatically generated. You can append additional configuration parameters to the end of the connection URL.
Sample result:
```
Connected to: Spark SQL (version 3.2.0)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
```
Use the Spark distributed SQL engine to execute a Spark SQL statement.
```
SHOW TABLES;
```

Use a Java application to connect to the Spark distributed SQL engine

Configure environment variables.
```
export HIVEURL="jdbc:hive2://jdbc:hive2://amv-bp1c3em7b2e****-spark.ads.aliyuncs.com:10001/adb_test;transportMode=http;httpPath=s202401301607hz**********.cliservice"
export HIVEUSER="AdbSpark14****"
export HIVEPASSWORD="Spark23****"
```
- HIVEURL: the connection URL of the Spark distributed SQL engine that is obtained in Step 1.
- HIVEUSER: the username of the Spark distributed SQL engine.
- HIVEPASSWORD: the password of the Spark distributed SQL engine.

Add Maven dependencies to the pom.xml file. Sample code:

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>2.3.9</version>
    </dependency>

Connect to the Spark distributed SQL engine and execute a Spark SQL statement.

package com.aliyun;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class TestHive {
    public static void main(String[] args) throws Exception {
        String hiveUrl = System.getenv("HIVEURL");;
        String username = System.getenv("HIVEUSER");
        String password = System.getenv("HIVEPASSWORD");
        // The driver is installed.
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        // Connect to the Spark distributed SQL engine by using the connection URL that is obtained in Step 1.
        String url = String.format(hiveUrl);
        Connection con = DriverManager.getConnection(url, username, password);
        Statement stmt = con.createStatement();
        // Execute the SHOW TABLES statement.
        ResultSet tables = stmt.executeQuery("show tables");
        while (tables.next()) {
            System.out.println(tables.getString("tableName"));
        }
        // Query table data.
        ResultSet rows = stmt.executeQuery("select * from test");
        while (rows.next()) {
            System.out.println(rows.getString(0));
        }
    }
}

Important

If you set the spark.sql.hive.metastore.version parameter to adb when you start the Spark distributed SQL engine, you must include an existing database of the AnalyticDB for MySQL cluster in the connection URL. If you do not include an existing database in the connection URL, the default database is used and the connection fails.
When network jitter occurs, the ThriftServer HTTP protocol returns the HTTP status code 509, which indicates that access is denied, and does not automatically retry the connection. You must rerun the command to connect to the Spark distributed SQL engine.
You cannot modify or delete the transportMode=http;httpPath=s202401301607hzdf852**********.cliservice parameter that is automatically generated. You can append additional configuration parameters to the end of the connection URL.