This topic describes how to use AnalyticDB for MySQL spark-submit to develop Spark applications. In the example, an Elastic Compute Service (ECS) instance is connected to AnalyticDB for MySQL.
Background information
When you connect to an AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster from a client and develop Spark applications, you must use AnalyticDB for MySQL spark-submit to submit Spark applications.
AnalyticDB for MySQL spark-submit can be used to submit Spark batch applications, not Spark SQL applications.
Prerequisites
Java Development Kit (JDK) V1.8 or later is installed.
Download and install AnalyticDB for MySQL spark-submit
Run the following command to download the installation package of AnalyticDB for MySQL spark-submit. The file name is
adb-spark-toolkit-submit-0.0.1.tar.gz
.wget https://dla003.oss-cn-hangzhou.aliyuncs.com/adb-spark-toolkit-submit-0.0.1.tar.gz
Run the following command to extract the package and install AnalyticDB for MySQL spark-submit.
tar zxvf adb-spark-toolkit-submit-0.0.1.tar.gz
Configure parameters
You can modify the configuration parameters of an AnalyticDB for MySQL Spark batch application in the conf/spark-defaults.conf
file or by calling commands. We recommend that you modify configuration parameters in the conf/spark-defaults.conf
file. After the modification, AnalyticDB for MySQL spark-submit automatically reads the configuration file. If you use a CLI to modify the configuration parameters, the parameters in the configuration file are not overwritten, but the configurations in the command take precedence.
After you install AnalyticDB for MySQL spark-submit, run the following command to open the
adb-spark-toolkit-submit/conf/spark-defaults.conf
file:vim adb-spark-toolkit-submit/conf/spark-defaults.conf
Configure parameters in the
key=value
format. Example:keyId = yourAkId secretId = yourAkSec regionId = cn-hangzhou clusterId = amv-bp15f9q95p**** rgName = sg-default ossUploadPath = oss://<bucket_name>/sparkjars/ spark.driver.resourceSpec = medium spark.executor.instances = 2 spark.executor.resourceSpec = medium spark.adb.roleArn = arn:1234567/adbsparkprocessrole spark.adb.eni.vswitchId = vsw-defaultswitch spark.adb.eni.securityGroupId = sg-defaultgroup spark.app.log.rootPath = oss://<bucket_name>/sparklogs/
ImportantYou must replace the sample values with the actual values.
In the example, the keyId, secretId, regionId, clusterId, rgName, ossKeyId, and ossUploadPath parameters are supported only by AnalyticDB for MySQL spark-submit, not by Apache Spark. For more information about the parameters, see the "Parameters" section of this topic.
You can configure the parameters in the command for submitting Spark jobs in the
--key1 value1 --key2 value2
format. You can also use AnalyticDB for MySQL SparkConf to configure the parameters. For more information, see the "Configuration compatibility" section of this topic.For information about the other parameters in the sample code or about all configuration parameters that are supported by AnalyticDB for MySQL spark-submit, see Spark application configuration parameters.
Table 1. Parameters Parameter
Description
Required
keyId
The AccessKey ID of the Alibaba Cloud account or Resource Access Management (RAM) user that is used to run Spark jobs. For information about how to view the AccessKey ID, see Accounts and permissions.
Yes
secretId
The AccessKey secret of the Alibaba Cloud account or RAM user that is used to run Spark jobs.
Yes
regionId
The ID of the region where the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster resides.
Yes
clusterId
The ID of the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
Yes
rgName
The resource group of the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
Yes
ossKeyId
The AccessKey ID of the Alibaba Cloud account or RAM user that is used to create the Object Storage Service (OSS) bucket.
You can specify an OSS directory in the configuration, so that JAR packages in your on-premises storage can be automatically uploaded to the OSS directory.
No
ossSecretId
The AccessKey secret of the Alibaba Cloud account or RAM user that is used to create the OSS bucket.
No
ossEndpoint
The internal endpoint of the OSS bucket. For information about the mappings between OSS regions and endpoints, see Regions and endpoints.
No
ossUploadPath
The OSS directory to which JAR packages are uploaded.
No
Configuration compatibility
To ensure compatibility with open source spark-submit, the keyId, secretId, regionId, clusterId, rgName, ossKeyId, and ossUploadPath parameters can also be configured by using AnalyticDB for MySQL SparkConf in the following format:
--conf spark.adb.access.key.id=<value>
--conf spark.adb.access.secret.id=<value>
--conf spark.adb.regionId=<value>
--conf spark.adb.clusterId=<value>
--conf spark.adb.rgName=<value>
--conf spark.adb.oss.akId=<value>
--conf spark.adb.oss.akSec=<value>
--conf spark.adb.oss.endpoint=<value>
--conf spark.adb.oss.uploadPath=<value>
Submit a Spark job
Run the following command to open the directory of AnalyticDB for MySQL spark-submit:
cd adb-spark-toolkit-submit
Submit a job in the following format:
./bin/spark-submit \ --class com.aliyun.spark.oss.SparkReadOss \ --verbose \ --name <your_job_name> \ --jars oss://<bucket-name>/jars/xxx.jar,oss://<bucket-name>/jars/xxx.jar\ --conf spark.driver.resourceSpec=medium \ --conf spark.executor.instances=1 \ --conf spark.executor.resourceSpec=medium \ oss://<bucket-name>/jars/xxx.jar args0 args1
NoteOne of the following response code is returned after you submit the Spark job:
255: The job failed.
0: The job is successfully executed.
143: The job was terminated.
The following table describes the parameters.
Parameter | Example | Description |
--class | <class_name> | The entry point of the Java or Scala application. The entry point is not required for a Python application. |
--verbose | None | Displays the logs that are generated during the submission of the Spark job. |
--name | <spark_name> | The name of the Spark application. |
--jars | <jar_name> | The absolute paths of the JAR packages that are required for the Spark application. Separate multiple JAR packages with commas (,).
Note Make sure that you have the permissions to access the OSS directory. You can log on to the RAM console and grant the AliyunOSSFullAccess permission to the RAM user on the Users page. |
--conf | <key=value> | The configuration parameters of the Spark application. The configuration of AnalyticDB for MySQL spark-submit is highly similar to open source spark-submit. For information about the parameters that are different and the parameters that are specific in AnalyticDB for MySQL spark-submit, see the "Parameters specific in AnalyticDB for MySQL spark-submit" section of this topic and Spark application configuration parameters. Note
|
oss://<bucket-name>/jars/xxx.jar | <oss_path> | The absolute path of the main file of the Spark application. The main file can be a JAR package that contains the entry point or an executable file that serves as the entry point for the Python program. Note You must store the main files of Spark applications in OSS. |
args | <args0 args1> | The parameters that are required for the JAR packages. Separate multiple parameters with spaces. |
Query a list of Spark jobs
./bin/spark-submit --list --clusterId <cluster_Id> --rgName <ResourceGroup_name> --pagenumber 1 --pagesize 3
Query the status of a Spark job
./bin/spark-submit --status <appId>
You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.
For more information about status of Spark jobs, see SparkAppInfo.
Query the parameters of the submitted job and SparkUI
./bin/spark-submit --detail <appId>
You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.
The Spark WEB UI field in the returned results indicates the Spark UI address.
Queries the logs of a Spark job
./bin/spark-submit --get-log <appId>
You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.
Terminate a Spark job
./spark-submit --kill <appId>
You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.
Differences between AnalyticDB for MySQL spark-submit and open source spark-submit
Parameters specific in AnalyticDB for MySQL spark-submit
Parameter | Description |
--api-retry-times | The maximum number of retries allowed when AnalyticDB for MySQL spark-submit fails to run a command. Commands for submitting Spark jobs are not retried. This is because job submission is not an idempotent operation. Submissions that fail due to reasons like network timeout may have actually succeeded in the background. Therefore, retrying job submission commands may result in duplicated submissions. To check whether a job has been successfully submitted, you need to use |
--time-out-seconds | The timeout period in AnalyticDB for MySQL spark-submit after which a command is retried. Unit: seconds. Default value: 10. |
--enable-inner-endpoint | When you submit a Spark job from an ECS instance, you can specify this parameter, so that AnalyticDB for MySQL spark-submit can access services within the virtual private cloud (VPC). Important Only the following regions support service access within VPCs: China (Hangzhou), China (Shanghai), and China (Shenzhen). |
--list | Obtains a list of jobs. In most cases, this parameter is used together with For example, if you want to view five jobs on the first page, you can configure the parameters as follows:
|
--pagenumber | The page number. Default value: 1. |
--pagesize | The maximum number of jobs to return on each page. Default value: 10. |
--kill | Terminates the job. |
--get-log | The logs of the application. |
--status | The details about the application. |
Parameters specific in open source spark-submit
AnalyticDB for MySQL spark-submit does not support the configuration parameters of open source spark-submit. For more information, see the "Configuration parameters that are not supported for AnalyticDB for MySQL" section of the Spark application configuration parameters topic.