If you want to analyze data in real time or use the Java Database Connectivity (JDBC) API to develop Spark SQL jobs, you can use a Spark distributed SQL engine provided by AnalyticDB for MySQL to develop the SQL jobs. Spark distributed SQL engines allow you to use SQL statements to easily analyze, process, and query data and improve SQL execution efficiency. This topic describes how to start and connect to a Spark distributed SQL engine.
Overview
AnalyticDB for MySQL Spark provides managed services for open source Spark distributed SQL engines to develop Spark SQL jobs. A Spark distributed SQL engine is a distributed SQL execution engine that is built into Spark and supports the JDBC API. You can use the JDBC API to connect to a Spark distributed SQL engine from clients that support the standard Hive ThriftServer
.
You can start multiple Spark distributed SQL engines. Spark distributed SQL engines have mutually isolated execution environments and independent URLs, but the distributed SQL engines share the same network bandwidth. In this case, the use of multiple Spark distributed SQL engines is not suitable for importing large amounts of data. For more information about Spark distributed SQL engines, see Distributed SQL Engine.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster.
A Java 8 development environment that is used to run a Beeline client and a Java application is installed.
The IP address of the client is added to a whitelist of the AnalyticDB for MySQL cluster. For more information, see Configure a whitelist.
Start a Spark distributed SQL engine
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose
.In the upper part of the Spark editor, select a job resource group and the SQLEngine job type.
Run the following code in the Spark editor:
-- Here is just an example of start Spark SQLEngine. Modify the content and run your spark program. CONF spark.adb.eni.enabled=true; CONF spark.adb.eni.vswitchId=vsw-bp1wh6lp1abs3fi0****; CONF spark.adb.eni.securityGroupId=sg-bp15ncw8v0624xe****; CONF spark.driver.resourceSpec=medium; CONF spark.executor.instances=1; CONF spark.executor.resourceSpec=small; CONF spark.app.name=Spark SQLEngine; CONF spark.sql.hive.metastore.version=adb; CONF spark.kubernetes.driverEnv.HIVE_SERVER2_USER=AdbSpark14****; CONF spark.kubernetes.driverEnv.HIVE_SERVER2_PASSWORD=Spark23****; CONF spark.adb.sessionTTLSeconds=1200;
The following table describes the parameters.
Parameter
Required
Description
spark.adb.eni.enabled
Yes if specific conditions are met
Specifies whether to enable the Elastic Network Interface (ENI) feature.
If you use an IP address-based URL to connect to the Spark distributed SQL engine, set this parameter to true.
If you use an endpoint-based URL to connect to the Spark distributed SQL engine, you do not need to configure this parameter.
spark.adb.eni.vswitchId
Yes if specific conditions are met
The ID of the vSwitch that is associated with an ENI.
If you use an IP address-based URL to connect to the Spark distributed SQL engine, set this parameter to the vSwitch ID of the virtual private cloud (VPC) in which the client resides.
If you use an endpoint-based URL to connect to the Spark distributed SQL engine, you do not need to configure this parameter.
spark.adb.eni.securityGroupId
Yes if specific conditions are met
The ID of the security group that is associated with an ENI.
If you use an IP address-based URL to connect to the Spark distributed SQL engine, set this parameter to the security group ID of the VPC in which the client resides.
If you use an endpoint-based URL to connect to the Spark distributed SQL engine, you do not need to configure this parameter.
spark.app.name
Yes
The name of the Spark application.
spark.sql.hive.metastore.version
Yes
The version of the metastore service. Valid values:
adb: the version of AnalyticDB for MySQL.
<hive_version>: the version of the Hive Metastore.
NoteFor information about the Hive versions that are supported by Apache Spark, see Spark Configuration.
To access a self-managed Hive Metastore, you can replace the default configuration with the standard Apache Spark configuration. For more information, see Spark Configuration.
spark.kubernetes.driverEnv.HIVE_SERVER2_USER
No
The username of the Spark distributed SQL engine. The username must contain uppercase letters, lowercase letters, and digits. No limit is imposed on the length of the username.
spark.kubernetes.driverEnv.HIVE_SERVER2_PASSWORD
No
The password of the Spark distributed SQL engine. The password must contain uppercase letters, lowercase letters, and digits. No limit is imposed on the length of the password.
spark.adb.sessionTTLSeconds
No
The time to live (TTL) of the Spark distributed SQL engine. Unit: seconds. Default value: 1200. The default value specifies that the Spark distributed SQL engine is automatically destroyed 1,200 seconds after the last SQL code block is executed.
NoteAfter the Spark distributed SQL engine is destroyed, the status of the Spark JAR job changes to Completed.
For information about other parameters, see Spark application configuration parameters.
Click Run Now.
ImportantYou can view the status of the Spark JAR job on the Applications tab. You can connect to the Spark distributed SQL engine only when the Spark JAR job is in the Running state.
Connect to a Spark distributed SQL engine
You can connect to a Spark distributed SQL engine from clients that support the standard Hive ThriftServer
by using an IP address-based URL or an endpoint-based URL. For information about the clients, see HiveServer2 Clients.
Usage notes for connecting to a Spark distributed SQL engine by using an IP address-based URL or an endpoint-based URL:
(Recommended) Connect to a Spark distributed SQL engine by using an IP address-based URL: If you set the spark.adb.eni.enabled parameter to true to enable the ENI feature and the client and the ENI reside in the same VPC, you can use an IP address-based URL to connect to the Spark distributed SQL engine. In this case, you must obtain the IP address-based URL of the Spark distributed SQL engine. If the client resides in another VPC, you must modify the vSwitch ID of the ENI that is specified by the spark.adb.eni.vswitchId parameter and restart the Spark distributed SQL engine.
Connect to a Spark distributed SQL engine by using an endpoint-based URL: If the client and the AnalyticDB for MySQL cluster reside in the same VPC or you want to establish a connection over the Internet, you can use an endpoint-based URL to connect to the Spark distributed SQL engine. In this case, you must obtain the endpoint-based URL of the Spark distributed SQL engine.
Step 1: Obtain the connection URL of the Spark distributed SQL engine
Obtain the IP address-based URL of the Spark distributed SQL engine
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose
.On the Applications tab, find the Spark application that you want to manage and click Log in the Actions column to obtain the IP address-based URL of the Spark distributed SQL engine.
jdbc:hive2://172.XX.XX.XX:10001/<db>;transportMode=http;httpPath=s202401301607hz**********.cliservice
Obtain the endpoint-based URL of the Spark distributed SQL engine
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose
.On the Applications tab, find the Spark application that you want to manage and click Details in the Actions column to obtain the endpoint-based URL of the Spark distributed SQL engine.
If you want to connect to the Spark distributed SQL engine over the Internet, you must use the endpoint-based URL that corresponds to the
Spark JDBC Public Connect URL
parameter.If you want to connect to the Spark distributed SQL engine over an VPC, you must use the endpoint-based URL that corresponds to the
Spark JDBC VPC Connect URL
parameter.
ImportantThe first time you start a Spark distributed SQL engine, the system requires approximately 1 minute to initialize the network. After the network is initialized, you can obtain the endpoint-based URL.
An endpoint-based URL cannot be created when you apply for or release a public endpoint. If an endpoint-based URL cannot be created after you start the Spark distributed SQL engine multiple times, submit a ticket.
Step 2: Connect to the Spark distributed SQL engine
In this example, a Beeline client and a Java application are used to connect to the Spark distributed SQL engine.
Use a Beeline client to connect to the Spark distributed SQL engine
Connect to the Spark distributed SQL engine.
Syntax:
! connect <Connection URL> <Username> <Password>
Connection URL: the connection URL of the Spark distributed SQL engine that is obtained in Step 1.
Username: the username of the Spark distributed SQL engine.
Password: the password of the Spark distributed SQL engine.
Example:
!connect jdbc:hive2://jdbc:hive2://amv-bp1c3em7b2e****-spark.ads.aliyuncs.com:10001/adb_test;transportMode=http;httpPath=s202401301607hz**********.cliservice AdbSpark14**** Spark23****
ImportantIf you set the
spark.sql.hive.metastore.version
parameter to adb when you start the Spark distributed SQL engine, you must include an existing database of the AnalyticDB for MySQL cluster in the connection URL. If you do not include an existing database in the connection URL, thedefault
database is used and the connection fails.When network jitter occurs, the ThriftServer HTTP protocol returns the HTTP status code 509, which indicates that access is denied, and does not automatically retry the connection. You must rerun the command to connect to the Spark distributed SQL engine.
You cannot modify or delete the
transportMode=http;httpPath=s202401301607hzdf852**********.cliservice
parameter that is automatically generated. You can append additional configuration parameters to the end of the connection URL.
Sample result:
Connected to: Spark SQL (version 3.2.0) Driver: Hive JDBC (version 2.3.9) Transaction isolation: TRANSACTION_REPEATABLE_READ
Use the Spark distributed SQL engine to execute a Spark SQL statement.
SHOW TABLES;
Use a Java application to connect to the Spark distributed SQL engine
Configure environment variables.
export HIVEURL="jdbc:hive2://jdbc:hive2://amv-bp1c3em7b2e****-spark.ads.aliyuncs.com:10001/adb_test;transportMode=http;httpPath=s202401301607hz**********.cliservice" export HIVEUSER="AdbSpark14****" export HIVEPASSWORD="Spark23****"
HIVEURL: the connection URL of the Spark distributed SQL engine that is obtained in Step 1.
HIVEUSER: the username of the Spark distributed SQL engine.
HIVEPASSWORD: the password of the Spark distributed SQL engine.
Add Maven dependencies to the pom.xml file. Sample code:
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>2.3.9</version> </dependency>
Connect to the Spark distributed SQL engine and execute a Spark SQL statement.
package com.aliyun; import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class TestHive { public static void main(String[] args) throws Exception { String hiveUrl = System.getenv("HIVEURL");; String username = System.getenv("HIVEUSER"); String password = System.getenv("HIVEPASSWORD"); // The driver is installed. Class.forName("org.apache.hive.jdbc.HiveDriver"); // Connect to the Spark distributed SQL engine by using the connection URL that is obtained in Step 1. String url = String.format(hiveUrl); Connection con = DriverManager.getConnection(url, username, password); Statement stmt = con.createStatement(); // Execute the SHOW TABLES statement. ResultSet tables = stmt.executeQuery("show tables"); while (tables.next()) { System.out.println(tables.getString("tableName")); } // Query table data. ResultSet rows = stmt.executeQuery("select * from test"); while (rows.next()) { System.out.println(rows.getString(0)); } } }
ImportantIf you set the
spark.sql.hive.metastore.version
parameter to adb when you start the Spark distributed SQL engine, you must include an existing database of the AnalyticDB for MySQL cluster in the connection URL. If you do not include an existing database in the connection URL, thedefault
database is used and the connection fails.When network jitter occurs, the ThriftServer HTTP protocol returns the HTTP status code 509, which indicates that access is denied, and does not automatically retry the connection. You must rerun the command to connect to the Spark distributed SQL engine.
You cannot modify or delete the
transportMode=http;httpPath=s202401301607hzdf852**********.cliservice
parameter that is automatically generated. You can append additional configuration parameters to the end of the connection URL.