AnalyticDB for MySQL provides the spark-submit command-line tool. If you use a client to connect to an AnalyticDB for MySQL cluster for Spark development, you can use the spark-submit command-line tool to submit Spark applications. This topic describes how to use the spark-submit command-line tool of AnalyticDB for MySQL to develop Spark applications.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
Java Development Kit (JDK) 1.8 or later is installed.
Usage notes
When you use the spark-submit command-line tool to develop jobs, you can submit only Spark JAR applications. You cannot submit Spark SQL applications.
Download and install the spark-submit command-line tool
Run the following command to download the installation package of the spark-submit command-line tool named
adb-spark-toolkit-submit-0.0.1.tar.gz
:wget https://dla003.oss-cn-hangzhou.aliyuncs.com/adb-spark-toolkit-submit-0.0.1.tar.gz
Run the following command to decompress the installation package and install the spark-submit command-line tool:
tar zxvf adb-spark-toolkit-submit-0.0.1.tar.gz
Spark application configuration parameters
After you decompress the installation package of the spark-submit command-line tool, go to the adb-spark-toolkit-submit/conf
directory and run the vim spark-defaults.conf
command to modify the configuration parameters in the spark-defaults.conf file. The spark-submit script automatically reads the configuration file. The configuration parameters take effect on all Spark applications.
The following table describes the Spark application configuration parameters.
Parameter | Required | Description | Example |
keyId | Yes | The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user that has permissions to access AnalyticDB for MySQL resources. For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions. | LTAI5tFosFYFgrt3NzBX**** |
secretId | Yes | The AccessKey secret of an Alibaba Cloud account or a RAM user that has permissions to access AnalyticDB for MySQL resources. For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions. | 1BvoSGRT4FV7GB07VVgrRGUty**** |
regionId | Yes | The region ID of the AnalyticDB for MySQL cluster. | cn-hangzhou |
clusterId | Yes | The ID of the AnalyticDB for MySQL cluster. | amv-bp1908350u5**** |
rgName | Yes | The name of the job resource group that is used to run Spark applications. | test |
ossKeyId | No | If the JAR packages that are required for Spark JAR applications are stored on your on-premises device, you must specify the ossKeyId, ossSecretId, and ossUploadPath parameters.
| LTAI5tFosFYFgrt3NzBX**** |
ossSecretId | No | 1BvoSGRT4FV7GB07VVgrRGUty**** | |
ossUploadPath | No | oss://testBucketname/jars/test1.jar | |
conf parameters | No | The configuration parameters that are required for Spark applications, which are similar to those of Apache Spark. The parameters must be in the |
Submit a Spark application
Upload the JAR packages that are required for Spark applications to OSS. For more information, see Simple upload.
Run the following command to go to the directory of the spark-submit command-line tool:
cd adb-spark-toolkit-submit
Submit a Spark application in the following format:
./bin/spark-submit \ --class com.aliyun.spark.oss.SparkReadOss \ --verbose \ --name Job1 \ --jars oss://testBucketname/jars/test.jar,oss://testBucketname/jars/search.jar\ --conf spark.driver.resourceSpec=medium \ --conf spark.executor.instances=1 \ --conf spark.executor.resourceSpec=medium \ oss://testBucketname/jars/test1.jar args0 args1
NoteAfter you submit a Spark application, one of the following return codes is returned:
255: The application failed to run.
0: The application is run successfully.
143: The application is terminated.
The following table describes the parameters.
Parameter
Description
--class
The entry class of the Java or Scala application. The entry class is not required for a Python application.
--verbose
Displays the logs that are generated when you submit the Spark application.
--name
The name of the Spark application.
--jars
The absolute paths of the JAR packages that are required for the Spark application. Separate multiple paths with commas (,).
If you specify on-premises paths, take note of the following items:
The RAM user that you use must have the AliyunOSSFullAccess permission.
Make sure that the
ossUploadPath
parameter is configured in theconf/spark-defaults.conf
file to specify the OSS path to which you want to upload the on-premises JAR packages.
While an on-premises JAR package is being uploaded, the system verifies the package based on the hash value generated by using the Message Digest 5 (MD5) algorithm. If a JAR package that has the same name and MD5 value already exists in the specified OSS path, the upload is canceled.
If you manually updated a JAR package in the OSS path, you must delete the corresponding MD5 file of the package.
--conf
The configuration parameters of the Spark application.
The configuration parameters of the Spark application are highly similar to those of the open source spark-submit command-line tool. For information about the configuration parameters that are different from the open source spark-submit command-line tool and specific to the spark-submit command-line tool of AnalyticDB for MySQL, see the "Differences between AnalyticDB for MySQL spark-submit and open source spark-submit" section of this topic.
NoteSpecify multiple parameters in the following format: --conf key1=value1 --conf key2=value2.
oss://testBucketname/jars/test1.jar args0 args1
The absolute path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for Python.
NoteYou must store the main files of Spark applications in OSS.
args
The arguments that are required for the JAR packages. Separate multiple arguments with spaces.
Query a list of Spark applications
./bin/spark-submit --list --clusterId <cluster_Id> --rgName <ResourceGroup_name> --pagenumber 1 --pagesize 3
Parameters:
cluster_Id: the ID of the AnalyticDB for MySQL cluster.
ResourceGroup_name: the name of the job resource group that is used to run Spark applications.
Query the status of a Spark application
./bin/spark-submit --status <appId>
You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.
Query the submission parameters and the Spark UI URL of a Spark application
./bin/spark-submit --detail <appId>
You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.
The Spark WEB UI parameter in the command output indicates the Spark UI URL.
Query the logs of a Spark application
./bin/spark-submit --get-log <appId>
You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.
Terminate a Spark application
./bin/spark-submit --kill <appId>
You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.
Differences between AnalyticDB for MySQL spark-submit and open source spark-submit
Parameters specific to AnalyticDB for MySQL spark-submit
Parameter | Description |
--api-retry-times | The maximum number of retries that is allowed if the spark-submit command-line tool of AnalyticDB for MySQL fails to run a command. Default value: 3. Commands for submitting Spark applications are not retried because a submission is not an idempotent operation. Submissions that fail due to reasons like network timeout may have actually succeeded in the background. In this case, retrying application submission commands may result in duplicated submissions. To check whether an application is successfully submitted, you can use the |
--time-out-seconds | The timeout period after which the spark-submit command-line tool of AnalyticDB for MySQL retries a failed command. Unit: seconds. Default value: 10. |
--enable-inner-endpoint | Enables the internal network connection. If you submit a Spark application from an Elastic Compute Service (ECS) instance, you can specify this parameter to allow the spark-submit command-line tool of AnalyticDB for MySQL to access services within a virtual private cloud (VPC). |
--list | Queries a list of applications. In most cases, this parameter is used together with the For example, if you want to query five applications on the first page, you can specify the following parameters:
|
--pagenumber | The page number. Default value: 1. |
--pagesize | The maximum number of applications to return on each page. Default value: 10. |
--kill | Terminates the application. |
--get-log | The logs of the application. |
--status | The details about the application. |
Parameters specific to open source spark-submit
The spark-submit command-line tool of AnalyticDB for MySQL does not support specific configuration parameters of the open source spark-submit command-line tool. For more information, see the "Configuration parameters not supported by AnalyticDB for MySQL" section of the Spark application configuration parameters topic.