Use the spark-submit CLI to submit a Spark job - E-MapReduce

This topic describes how to use the spark-submit CLI to submit a Spark job after E-MapReduce (EMR) Serverless Spark is connected to Elastic Compute Service (ECS).

Prerequisites

Java Development Kit (JDK) V1.8 or later is installed.
If you want to use a RAM user to submit Spark jobs, make sure that the RAM user is added to a Serverless Spark workspace as a member and assigned the developer role or a role that has higher permissions. For more information, see Manage users and roles.

Procedure

Step 1: Download and install the spark-submit tool for EMR Serverless

Click emr-serverless-spark-tool-0.1.0-bin.zip to download the installation package.
Upload the installation package to an ECS instance. For more information, see Upload and download files.
Run the following command to decompress the installation package and install the spark-submit tool:
```
unzip emr-serverless-spark-tool-0.1.0-bin.zip
```

Step 2: Configure parameters

Important

If the SPARK_CONF_DIR environment variable is configured in the environment in which the spark-submit tool is installed, you must store the configuration file in the directory specified by the SPARK_CONF_DIR environment variable. For example, for EMR clusters, the directory is /etc/taihao-apps/spark-conf in most cases. Otherwise, an error is reported.

Run the following command to modify the configuration of the connection.properties file:
```
vim emr-serverless-spark-tool-0.1.0/conf/connection.properties
```

Configure parameters in the file based on the following sample code. The parameters are specified in the key=value format.

accessKeyId=yourAccessKeyId
accessKeySecret=yourAccessKeySecret
# securityToken=yourSecurityToken
regionId=cn-hangzhou
endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
workspaceId=w-xxxxxxxxxxxx
resourceQueueId=dev_queue
# networkServiceId=xxxxxx
releaseVersion=esr-2.2 (Spark 3.3.1, Scala 2.12, Java Runtime)

The following table describes the parameters:

Parameter	Required	Description
accessKeyId	Yes	The AccessKey ID of the Alibaba Cloud account or RAM user that is used to run the Spark job.
accessKeySecret	Yes	The AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job.
securityToken	No	The Security Token Service (STS) token of the RAM user. Note This parameter is required only for STS authentication.
regionId	Yes	The region ID. In this example, the China (Hangzhou) region is used.
endpoint	Yes	The endpoint of EMR Serverless Spark. For more information, see Endpoints. In this example, the public endpoint in the China (Hangzhou) region `emr-serverless-spark.cn-hangzhou.aliyuncs.com` is used. Note If the ECS instance cannot access the Internet, you must use the virtual private cloud (VPC) endpoint of EMR Serverless Spark.
workspaceId	Yes	The ID of the EMR Serverless Spark workspace.
resourceQueueId	No	The name of the queue. Default value: dev_queue.
networkServiceId	No	The name of the network connection. Note This parameter is required only if the Spark job needs to access VPC resources. For more information, see Configure network connectivity between EMR Serverless Spark and a data source across VPCs.
releaseVersion	No	The version of EMR Serverless Spark. Example: esr-2.2 (Spark 3.3.1, Scala 2.12, Java Runtime).

Step 3: Submit a Spark job

Run the following command to go to the directory of the spark-submit tool:
```
cd emr-serverless-spark-tool-0.1.0
```

Submit the Spark job.

Spark job launched from Java or Scala

In this example, the test JAR package spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package. Then, you can upload the JAR package to Object Storage Service (OSS). The JAR package is a simple example provided by Spark to calculate the value of pi (π).

./bin/spark-submit  --name SparkPi \
--queue dev_queue  \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--class org.apache.spark.examples.SparkPi \
 oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \
10000

Spark job launched from PySpark

In this example, the test files DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files. Then, you can upload a JAR package that contains the files to OSS.

Note

The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.
The employee.csv file contains data such as employee names, departments, and salaries.

./bin/spark-submit --name PySpark \
--queue dev_queue  \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.tags.key=value \
--files oss://<yourBucket>/path/to/employee.csv \
oss://<yourBucket>/path/to/DataFrame.py \
10000

The following tables describe the parameters.

Supported parameters of the open source spark-submit tool

Parameter	Example	Description
--class	org.apache.spark.examples.SparkPi	The entry class of the Spark job. This parameter is required only if the Spark job is launched from Java or Scala.
--num-executors	10	The number of executors of the Spark job.
--driver-cores	1	The number of driver cores of the Spark job.
--driver-memory	4g	The size of driver memory of the Spark job.
--executor-cores	1	The number of executor cores of the Spark job.
--executor-memory	1024m	The size of executor memory of the Spark job.
--files	oss://<yourBucket>/file1,oss://<yourBucket>/file2	The resource files used by the Spark job. Only resource files stored in OSS are supported. Separate multiple files with commas (,).
--py-files	oss://<yourBucket>/file1.py,oss://<yourBucket>/file2.py	The Python scripts used by the Spark job. Only Python scripts stored in OSS are supported. Separate multiple scripts with commas (,). This parameter is required only if the Spark job is launched from PySpark.
--jars	oss://<yourBucket>/file1.jar,oss://<yourBucket>/file2.jar	The JAR packages used by the Spark job. Only JAR packages stored in OSS are supported. Separate multiple packages with commas (,).
--archives	oss://<yourBucket>/archive.tar.gz#env,oss://<yourBucket>/archive2.zip	The archive packages used by the Spark job. Only archive packages stored in OSS are supported. Separate multiple packages with commas (,).
--queue	root_queue	The name of the queue on which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace.
--conf	spark.tags.key=value	The custom parameter of the Spark job.
--status	jr-8598aa9f459d****	The status of the Spark job.
--kill	jr-8598aa9f459d****	Terminates the Spark job.

Other supported parameters

Parameter	Example	Description
--detach	N/A	Exits the spark-submit tool. If you use this parameter, you do not need to wait for the tool to return the job status. The spark-submit tool immediately exits after the Spark job is submitted.
--detail	jr-8598aa9f459d****	The details of the Spark job.

Unsupported parameters of the open source spark-submit tool
- --deploy-mode
- --master
- --proxy-user
- --repositories
- --keytab
- --principal
- --total-executor-cores
- --driver-library-path
- --driver-class-path
- --supervise
- --verbose

Step 4: Query the Spark job

Use the CLI

Query the status of the Spark job

cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --status <jr-8598aa9f459d****>

View the details of the Spark job

cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --detail <jr-8598aa9f459d****>

Use the UI

In the left-side navigation pane of the EMR Serverless Spark page, click Job History.
On the Job History page, click the Development Job Runs tab. Then, you can view all submitted jobs.

(Optional) Step 5: Terminate the Spark job

cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --kill <jr-8598aa9f459d****>

Note

You can terminate only a job that is in the Running state.