All Products
Search
Document Center

AnalyticDB:Use spark-submit to develop Spark applications

Last Updated:Sep 30, 2024

AnalyticDB for MySQL provides the spark-submit command-line tool. If you use a client to connect to an AnalyticDB for MySQL cluster for Spark development, you can use the spark-submit command-line tool to submit Spark applications. This topic describes how to use the spark-submit command-line tool of AnalyticDB for MySQL to develop Spark applications.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.

  • An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.

  • Java Development Kit (JDK) 1.8 or later is installed.

Usage notes

When you use the spark-submit command-line tool to develop jobs, you can submit only Spark JAR applications. You cannot submit Spark SQL applications.

Download and install the spark-submit command-line tool

  1. Run the following command to download the installation package of the spark-submit command-line tool named adb-spark-toolkit-submit-0.0.1.tar.gz:

    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/adb-spark-toolkit-submit-0.0.1.tar.gz
  2. Run the following command to decompress the installation package and install the spark-submit command-line tool:

    tar zxvf adb-spark-toolkit-submit-0.0.1.tar.gz

Spark application configuration parameters

After you decompress the installation package of the spark-submit command-line tool, go to the adb-spark-toolkit-submit/conf directory and run the vim spark-defaults.conf command to modify the configuration parameters in the spark-defaults.conf file. The spark-submit script automatically reads the configuration file. The configuration parameters take effect on all Spark applications.

The following table describes the Spark application configuration parameters.

Parameter

Required

Description

Example

keyId

Yes

The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user that has permissions to access AnalyticDB for MySQL resources.

For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

LTAI5tFosFYFgrt3NzBX****

secretId

Yes

The AccessKey secret of an Alibaba Cloud account or a RAM user that has permissions to access AnalyticDB for MySQL resources.

For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

1BvoSGRT4FV7GB07VVgrRGUty****

regionId

Yes

The region ID of the AnalyticDB for MySQL cluster.

cn-hangzhou

clusterId

Yes

The ID of the AnalyticDB for MySQL cluster.

amv-bp1908350u5****

rgName

Yes

The name of the job resource group that is used to run Spark applications.

test

ossKeyId

No

If the JAR packages that are required for Spark JAR applications are stored on your on-premises device, you must specify the ossKeyId, ossSecretId, and ossUploadPath parameters.

  • The ossKeyId and ossSecretId parameters specify the AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that you use. The RAM user must have the AliyunOSSFullAccess permission.

  • The ossUploadPath parameter specifies the OSS path to which you want to upload the on-premises JAR packages.

LTAI5tFosFYFgrt3NzBX****

ossSecretId

No

1BvoSGRT4FV7GB07VVgrRGUty****

ossUploadPath

No

oss://testBucketname/jars/test1.jar

conf parameters

No

The configuration parameters that are required for Spark applications, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.

Submit a Spark application

  1. Upload the JAR packages that are required for Spark applications to OSS. For more information, see Simple upload.

  2. Run the following command to go to the directory of the spark-submit command-line tool:

    cd adb-spark-toolkit-submit
  3. Submit a Spark application in the following format:

    ./bin/spark-submit  \
    --class com.aliyun.spark.oss.SparkReadOss \
    --verbose \
    --name Job1 \
    --jars oss://testBucketname/jars/test.jar,oss://testBucketname/jars/search.jar\
    --conf spark.driver.resourceSpec=medium \
    --conf spark.executor.instances=1 \
    --conf spark.executor.resourceSpec=medium \
    oss://testBucketname/jars/test1.jar args0 args1
    Note

    After you submit a Spark application, one of the following return codes is returned:

    • 255: The application failed to run.

    • 0: The application is run successfully.

    • 143: The application is terminated.

    The following table describes the parameters.

    Parameter

    Description

    --class

    The entry class of the Java or Scala application. The entry class is not required for a Python application.

    --verbose

    Displays the logs that are generated when you submit the Spark application.

    --name

    The name of the Spark application.

    --jars

    The absolute paths of the JAR packages that are required for the Spark application. Separate multiple paths with commas (,).

    If you specify on-premises paths, take note of the following items:

    • The RAM user that you use must have the AliyunOSSFullAccess permission.

    • Make sure that the ossUploadPath parameter is configured in the conf/spark-defaults.conf file to specify the OSS path to which you want to upload the on-premises JAR packages.

    • While an on-premises JAR package is being uploaded, the system verifies the package based on the hash value generated by using the Message Digest 5 (MD5) algorithm. If a JAR package that has the same name and MD5 value already exists in the specified OSS path, the upload is canceled.

    • If you manually updated a JAR package in the OSS path, you must delete the corresponding MD5 file of the package.

    --conf

    The configuration parameters of the Spark application.

    The configuration parameters of the Spark application are highly similar to those of the open source spark-submit command-line tool. For information about the configuration parameters that are different from the open source spark-submit command-line tool and specific to the spark-submit command-line tool of AnalyticDB for MySQL, see the "Differences between AnalyticDB for MySQL spark-submit and open source spark-submit" section of this topic.

    Note

    Specify multiple parameters in the following format: --conf key1=value1 --conf key2=value2.

    oss://testBucketname/jars/test1.jar args0 args1

    The absolute path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for Python.

    Note

    You must store the main files of Spark applications in OSS.

    args

    The arguments that are required for the JAR packages. Separate multiple arguments with spaces.

Query a list of Spark applications

./bin/spark-submit --list --clusterId <cluster_Id>  --rgName <ResourceGroup_name> --pagenumber 1 --pagesize 3

Parameters:

  • cluster_Id: the ID of the AnalyticDB for MySQL cluster.

  • ResourceGroup_name: the name of the job resource group that is used to run Spark applications.

Query the status of a Spark application

./bin/spark-submit --status <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

Query the submission parameters and the Spark UI URL of a Spark application

./bin/spark-submit --detail <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

The Spark WEB UI parameter in the command output indicates the Spark UI URL.

Query the logs of a Spark application

./bin/spark-submit --get-log <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

Terminate a Spark application

./bin/spark-submit --kill <appId>

You can obtain the appId of an application from the information about Spark applications. For more information, see the "Query a list of Spark applications" section of this topic.

Differences between AnalyticDB for MySQL spark-submit and open source spark-submit

Parameters specific to AnalyticDB for MySQL spark-submit

Parameter

Description

--api-retry-times

The maximum number of retries that is allowed if the spark-submit command-line tool of AnalyticDB for MySQL fails to run a command. Default value: 3. Commands for submitting Spark applications are not retried

because a submission is not an idempotent operation. Submissions that fail due to reasons like network timeout may have actually succeeded in the background. In this case, retrying application submission commands may result in duplicated submissions. To check whether an application is successfully submitted, you can use the --list parameter to obtain a list of submitted applications. You can also log on to the AnalyticDB for MySQL console to check the application list.

--time-out-seconds

The timeout period after which the spark-submit command-line tool of AnalyticDB for MySQL retries a failed command. Unit: seconds. Default value: 10.

--enable-inner-endpoint

Enables the internal network connection. If you submit a Spark application from an Elastic Compute Service (ECS) instance, you can specify this parameter to allow the spark-submit command-line tool of AnalyticDB for MySQL to access services within a virtual private cloud (VPC).

--list

Queries a list of applications. In most cases, this parameter is used together with the --pagenumber and --pagesize parameters.

For example, if you want to query five applications on the first page, you can specify the following parameters:

--list 
--pagenumber 1 
--pagesize 5

--pagenumber

The page number. Default value: 1.

--pagesize

The maximum number of applications to return on each page. Default value: 10.

--kill

Terminates the application.

--get-log

The logs of the application.

--status

The details about the application.

Parameters specific to open source spark-submit

The spark-submit command-line tool of AnalyticDB for MySQL does not support specific configuration parameters of the open source spark-submit command-line tool. For more information, see the "Configuration parameters not supported by AnalyticDB for MySQL" section of the Spark application configuration parameters topic.