Use the spark-submit command-line tool of AnalyticDB for MySQL to develop Spark applications - AnalyticDB

AnalyticDB for MySQL provides the spark-submit command-line tool. If you use a client to connect to an AnalyticDB for MySQL cluster for Spark development, you can use the spark-submit command-line tool to submit Spark applications. This topic describes how to use the spark-submit command-line tool of AnalyticDB for MySQL to develop Spark applications.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
Java Development Kit (JDK) 1.8 or later is installed.

Usage notes

When you use the spark-submit command-line tool to develop jobs, you can submit only Spark JAR applications. You cannot submit Spark SQL applications.

Download and install the spark-submit command-line tool

Run the following command to download the installation package of the spark-submit command-line tool named adb-spark-toolkit-submit-0.0.1.tar.gz:
```
wget https://dla003.oss-cn-hangzhou.aliyuncs.com/adb-spark-toolkit-submit-0.0.1.tar.gz
```
Run the following command to decompress the installation package and install the spark-submit command-line tool:
```
tar zxvf adb-spark-toolkit-submit-0.0.1.tar.gz
```

Spark application configuration parameters

After you decompress the installation package of the spark-submit command-line tool, go to the adb-spark-toolkit-submit/conf directory and run the vim spark-defaults.conf command to modify the configuration parameters in the spark-defaults.conf file. The spark-submit script automatically reads the configuration file. The configuration parameters take effect on all Spark applications.

The following table describes the Spark application configuration parameters.

Parameter	Required	Description	Example
keyId	Yes	The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user that has permissions to access AnalyticDB for MySQL resources. For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.	LTAI5tFosFYFgrt3NzBX****
secretId	Yes	The AccessKey secret of an Alibaba Cloud account or a RAM user that has permissions to access AnalyticDB for MySQL resources. For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.	1BvoSGRT4FV7GB07VVgrRGUty****
regionId	Yes	The region ID of the AnalyticDB for MySQL cluster.	cn-hangzhou
clusterId	Yes	The ID of the AnalyticDB for MySQL cluster.	amv-bp1908350u5****
rgName	Yes	The name of the job resource group that is used to run Spark applications.	test
ossKeyId	No	If the JAR packages that are required for Spark JAR applications are stored on your on-premises device, you must specify the ossKeyId, ossSecretId, and ossUploadPath parameters. The ossKeyId and ossSecretId parameters specify the AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that you use. The RAM user must have the AliyunOSSFullAccess permission. The ossUploadPath parameter specifies the OSS path to which you want to upload the on-premises JAR packages.	LTAI5tFosFYFgrt3NzBX****
ossSecretId	No		1BvoSGRT4FV7GB07VVgrRGUty****
ossUploadPath	No		oss://testBucketname/jars/test1.jar
conf parameters	No	The configuration parameters that are required for Spark applications, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.

Submit a Spark application

Upload the JAR packages that are required for Spark applications to OSS. For more information, see Simple upload.
Run the following command to go to the directory of the spark-submit command-line tool:
```
cd adb-spark-toolkit-submit
```

Submit a Spark application in the following format:

./bin/spark-submit  \
--class com.aliyun.spark.oss.SparkReadOss \
--verbose \
--name Job1 \
--jars oss://testBucketname/jars/test.jar,oss://testBucketname/jars/search.jar\
--conf spark.driver.resourceSpec=medium \
--conf spark.executor.instances=1 \
--conf spark.executor.resourceSpec=medium \
oss://testBucketname/jars/test1.jar args0 args1

Note

After you submit a Spark application, one of the following return codes is returned:

255: The application failed to run.
0: The application is run successfully.
143: The application is terminated.

The following table describes the parameters.

Parameter	Description
--class	The entry class of the Java or Scala application. The entry class is not required for a Python application.
--verbose	Displays the logs that are generated when you submit the Spark application.
--name	The name of the Spark application.
--jars	The absolute paths of the JAR packages that are required for the Spark application. Separate multiple paths with commas (,). If you specify on-premises paths, take note of the following items: The RAM user that you use must have the AliyunOSSFullAccess permission. Make sure that the `ossUploadPath` parameter is configured in the `conf/spark-defaults.conf` file to specify the OSS path to which you want to upload the on-premises JAR packages. While an on-premises JAR package is being uploaded, the system verifies the package based on the hash value generated by using the Message Digest 5 (MD5) algorithm. If a JAR package that has the same name and MD5 value already exists in the specified OSS path, the upload is canceled. If you manually updated a JAR package in the OSS path, you must delete the corresponding MD5 file of the package.
--conf	The configuration parameters of the Spark application. The configuration parameters of the Spark application are highly similar to those of the open source spark-submit command-line tool. For information about the configuration parameters that are different from the open source spark-submit command-line tool and specific to the spark-submit command-line tool of AnalyticDB for MySQL, see the "Differences between AnalyticDB for MySQL spark-submit and open source spark-submit" section of this topic. Note Specify multiple parameters in the following format: --conf key1=value1 --conf key2=value2.
oss://testBucketname/jars/test1.jar args0 args1	The absolute path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for Python. Note You must store the main files of Spark applications in OSS.
args	The arguments that are required for the JAR packages. Separate multiple arguments with spaces.

Query a list of Spark applications

./bin/spark-submit --list --clusterId <cluster_Id>  --rgName <ResourceGroup_name> --pagenumber 1 --pagesize 3

Parameters:

cluster_Id: the ID of the AnalyticDB for MySQL cluster.
ResourceGroup_name: the name of the job resource group that is used to run Spark applications.

Query the status of a Spark application

./bin/spark-submit --status <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

Query the submission parameters and the Spark UI URL of a Spark application

./bin/spark-submit --detail <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

The Spark WEB UI parameter in the command output indicates the Spark UI URL.

Query the logs of a Spark application

./bin/spark-submit --get-log <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

Terminate a Spark application

./bin/spark-submit --kill <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

Differences between AnalyticDB for MySQL spark-submit and open source spark-submit

Parameters specific to AnalyticDB for MySQL spark-submit

Parameter	Description
--api-retry-times	The maximum number of retries that is allowed if the spark-submit command-line tool of AnalyticDB for MySQL fails to run a command. Default value: 3. Commands for submitting Spark applications are not retried because a submission is not an idempotent operation. Submissions that fail due to reasons like network timeout may have actually succeeded in the background. In this case, retrying application submission commands may result in duplicated submissions. To check whether an application is successfully submitted, you can use the `--list` parameter to obtain a list of submitted applications. You can also log on to the AnalyticDB for MySQL console to check the application list.
--time-out-seconds	The timeout period after which the spark-submit command-line tool of AnalyticDB for MySQL retries a failed command. Unit: seconds. Default value: 10.
--enable-inner-endpoint	Enables the internal network connection. If you submit a Spark application from an Elastic Compute Service (ECS) instance, you can specify this parameter to allow the spark-submit command-line tool of AnalyticDB for MySQL to access services within a virtual private cloud (VPC).
--list	Queries a list of applications. In most cases, this parameter is used together with the `--pagenumber` and `--pagesize` parameters. For example, if you want to query five applications on the first page, you can specify the following parameters: `--list --pagenumber 1 --pagesize 5`
--pagenumber	The page number. Default value: 1.
--pagesize	The maximum number of applications to return on each page. Default value: 10.
--kill	Terminates the application.
--get-log	The logs of the application.
--status	The details about the application.

Parameters specific to open source spark-submit

The spark-submit command-line tool of AnalyticDB for MySQL does not support specific configuration parameters of the open source spark-submit command-line tool. For more information, see the "Configuration parameters not supported by AnalyticDB for MySQL" section of the Spark application configuration parameters topic.