Spark application configuration guide - - Alibaba Cloud Documentation Center

Spark applications in AnalyticDB for MySQL Data Lakehouse Edition (V3.0) are described in the JSON format. The configuration script contains all information about applications, including the application name, JAR package path, and configuration parameters. This topic describes how to configure a Spark application.

Precautions

Data Lakehouse Edition (V3.0) is now in canary release. To request a trial for Data Lakehouse Edition (V3.0), Submit a ticket.

Batch applications

Spark supports batch, streaming, and SQL applications. In particular, executing a batch application requires the JAR packages or Python files that contain the entry class. This may also change to include the JAR packages required for applications, the Python sandboxes, and the entry class parameters based on the actual requirements of your business.

You can write a Spark batch application in AnalyticDB for MySQL by using command-line parameters similar to those of the spark-submit tool.

Example of batch application configuration

The following example shows a typical Spark batch application that reads data from Object Storage Service (OSS). The configuration script includes parameters such as the application name, the JAR packages that contain the entry class, the entry class and its parameters, and the execution parameters. The configuration script is written in the JSON format. Example:

 {
  "args": ["oss://${testBucketName}/data/test/test.csv"],
  "name": "spark-oss-test",
  "file": "oss://${testBucketName/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.adb.connectors": "oss"
  }
}

The following table describes the parameters used for batch application configuration.


Parameter	Required	Example	Description
args	No	`"args":["args0", "args1"]`	The input parameters of the Spark application. Separate multiple parameters with commas (,).
name	No	`"name": "your_job_name"`	The name of the Spark application.
file	Yes for applications that are written in Python, Java, or Scala	`"file":"oss://testBucketName/path/to/your/jar"`	The path that stores the main file of the Spark application. The main file can be a JAR package that contains the entry class or an entry execution file of Python. Note The main files of Spark applications must be stored in OSS.
className	Yes for applications that are written in Java or Scala	`"className":"com.aliyun.spark.oss.SparkReadOss"`	The entry class of the Java or Scala program. The entry class is not required for Python.
sqls	Yes for SQL applications	`"sqls":["select * from xxxx","show databases"]`	The SQL statements that can be used to directly submit SQL batch applications without the need to specify JAR packages or Python files. This parameter cannot be used together with the file, className, or args parameter. You can specify multiple SQL statements for a Spark application. Separate multiple SQL statements with commas (,). They are executed in the specified order.
jars	No	`"jars":["oss://testBucketName/path/to/jar","oss://testBucketName/path/to/jar"]`	The JAR packages that are required for the Spark application. Separate multiple JAR packages with commas (,). When a Spark application runs, JAR packages are added to the classpaths of the driver and executor Java virtual machines (JVMs). Note All the JAR packages that are required for Spark applications must be stored in OSS.
files	No	`"files":["oss://testBucketName/path/to/files","oss://testBucketName/path/to/files"]`	The files that are required for the Spark application. These files are downloaded to the working directories of the driver and executors. Aliases can be configured for these files. Example: `oss://testBucketName/test/test1.txt#test1`. In this example, test1 is used as the file alias. You can access the file by specifying ./test1 or ./test1.txt. Separate multiple files with commas (,). Note If the log4j.properties file in the `oss://<path/to>/` directory is specified for this parameter, the Spark application uses the `log4j.properties` file as the log configuration file. All the files that are required for Spark applications must be stored in OSS.
archives	No	`"archives":["oss://testBucketName/path/to/archives","oss://testBucketName/path/to/archives"]`	The compressed packages that are required for the Spark application. The packages must be in the .TAR.GZ format. The packages are decompressed to the working directory of the Spark process. Aliases can be configured for the files contained in the package. Example: `oss://testBucketName/test/test1.tar.gz#test1`. In this example, test1 is used as the file alias. Assume that test2.txt is a file contained in the test1.tar.gz package. You can access the file by specifying ./test1/test2.txt or ./test1.tar.gz/test2.txt. Separate multiple packages with commas (,). Note All the compressed packages that are required for Spark applications must be stored in OSS. If a package fails to be decompressed, the task also fails.
pyFiles	No for Python applications	`"pyFiles":["oss://testBucketName/path/to/pyfiles","oss://testBucketName/path/to/pyfiles"]`	The Python files that are required for the PySpark application. The files must be in the ZIP, PY, or EGG format. If multiple Python files are required, we recommend that you use the files in the ZIP or EGG format. You can reference Python files in Python code by using modules. Separate multiple packages with commas (,). Note All the Python files required for PySpark application must be stored in OSS.
conf	Yes	`"conf":{"spark.driver.resourceSpec": "medium",spark.executor.resourceSpec":"medium,"spark.executor.instances": 2,"spark.adb.connectors": "oss"}`	The configuration parameters that are required for the Spark application, similar to those of Apache Spark. The parameters must be in the `key: value` format. Separate multiple parameters with commas (,). For more information about configuration parameters inconsistent with those of Apache Spark or configuration parameters specific to AnalyticDB for MySQL, see the "Description of conf parameters" section of this topic.

SQL applications

AnalyticDB for MySQL allows you to submit Spark SQL applications directly in the console without the need to package statements into a JAR file or write Python code. This helps data developers analyze data by using Spark. When you submit a Spark SQL application, you must set the application type to SQL.

Example of SQL application configuration

-- Here is just an example of SparkSQL. Modify the content and run your spark program.

conf spark.driver.resourceSpec=medium;
conf spark.executor.instances=2;
conf spark.executor.resourceSpec=medium;
conf spark.app.name=Spark SQL Test;
conf spark.adb.connectors=oss;

-- Here are your sql statements
show databases;

Statement types supported by Spark SQL

You can edit SQL statements in the Spark editor. Each individual statement must be separated by a semicolon (;).

Spark SQL supports the following types of statements:

CONF statements
- CONF statements are placed before SQL statements to specify Spark settings.
- Each CONF statement specifies the value of a parameter for submitting a Spark application. Each individual CONF statement must be separated by a semicolon (;).
- Do not enclose keys and values in the CONF statements with single quotation marks (') or double quotation marks (").
- For more information about configuration parameters supported by CONF statements, see the "Description of conf parameters" section of this topic.
ADD JAR statements
- ADD JAR statements are placed before SQL statements to add the JAR packages that are required for executing Spark SQL statements, such as the JAR packages of user-defined functions (UDFs) and JAR packages of various data source connectors. JAR packages must be specified in the OSS path format.
- Each ADD JAR statement specifies a JAR package in the OSS path format. You do not need to enclose the path with single quotation marks (') or double quotation marks ("). Each individual ADD JAR statement must be separated by a semicolon (;).
DDL or DML statements supported by Spark SQL, including SELECT and INSERT.

Switch the metadata service

By default, Spark SQL uses the metadata service provided by AnalyticDB for MySQL. You can use one of the following methods to switch to a different metadata service:

In-memory catalog

CONF spark.sql.catalogImplementation = in-memory;

Hive metastore 2.3.7 or other versions built into Spark
```
CONF spark.sql.catalogImplementation = hive;
CONF spark.sql.hive.metastore.version = 2.3.7;
```
Note To connect to a self-managed Hive metastore, you can replace the default configuration with the standard Apache Spark configuration. For more information about the standard Apache Spark configuration, see Spark Configuration.

Description of conf parameters

Spark configuration parameters of AnalyticDB for MySQL are similar to those of Apache Spark. The following tables describe the configuration parameters of AnalyticDB for MySQL inconsistent with those of Apache Spark and the configuration parameters specific to AnalyticDB for MySQL.

Specified driver and executor resources

Important The usage of the parameters described in the following table is different from that of Apache Spark.

The spark.driver.resourceSpec and spark.executor.resourceSpec parameters are set to the same value.


Parameter	Description	Corresponding parameter in Apache Spark
spark.driver.resourceSpec	The resource specifications of the Spark driver. Each type corresponds to a distinct specification. For more information, see the Type column in the "Spark resource specifications" section of this topic. Example: `CONF spark.driver.resourceSpec = c.small;`. In this example, the Spark driver occupies resource specifications of 1 core and 2 GB memory.	spark.driver.cores and spark.driver.memory
spark.executor.resourceSpec	The resource specifications of each Spark executor. Each type corresponds to a distinct specification. For more information, see the Type column in the "Spark resource specifications" section of this topic. Example: `CONF spark.executor.resourceSpec = c.small;`. In this example, each Spark executor occupies resource specifications of 1 core and 2 GB memory.	spark.executor.cores and spark.executor.memory

Spark UI


Parameter	Default value	Description
spark.app.log.rootPath	None	The directory in which the Spark UI event logs and the logs generated by AnalyticDB for MySQL Spark applications are stored. You must manually specify an OSS path for this parameter. Otherwise, you cannot access the Spark UI or view application logs.

RAM user O&M


Parameter	Default value	Description
spark.adb.roleArn	None	The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that is granted permissions to submit Spark applications in the RAM console. For more information, see RAM role overview. This parameter is required only when you want to submit Spark applications as a RAM user.

Built-in data source connectors in Spark of AnalyticDB for MySQL


Parameter	Default value	Description
spark.adb.connectors	None	The names of the built-in connectors in Spark of AnalyticDB for MySQL. Separate the connector names with commas (,). Valid values: OSS, hbase1.x, and tablestore.
spark.hadoop.io.compression.codec.snappy.native	false	Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes the Snappy files edited in Hadoop. If you set this parameter to true, the standard Snappy library is used for decompression. Otherwise, the default Snappy library of Hadoop is used for decompression.

VPC access and data source connection


Parameter	Default value	Description
spark.adb.eni.vswitchId	None	The ID of the vSwitch that is associated with an elastic network interface (ENI). This ID is used to access a virtual private cloud (VPC). If your Elastic Compute Service (ECS) instance can access a destination data source, you can set this parameter to the ID of the vSwitch to which your ECS instance is connected.
spark.adb.eni.securityGroupId	None	The ID of the security group that is associated with an ENI. This ID is used to access a VPC. If your ECS instance can access a destination data source, you can set this parameter to the ID of the security group to which your ECS instance belongs.
spark.adb.eni.extraHosts	None	The mappings between IP addresses and hostnames. This parameter allows Spark to parse the hostnames of data sources. This parameter is required if you want to connect to a self-managed Hive data source. Note Separate IP addresses and hostnames with spaces. Separate multiple groups of IP addresses and hostnames with commas (,). Example: "ip0 master0, ip1 master1".

Connection from Spark SQL to the metadata of AnalyticDB for MySQL


Parameter	Default value	Description
spark.sql.hive.metastore.version	None	The version of the Hive metastore. If you set this parameter to `ADB`, you can access the metadata of AnalyticDB for MySQL and read table data from AnalyticDB for MySQL.

Application reties


Parameter	Default value	Description
spark.adb.maxAttempts	1	The maximum number of retries. The default value is 1, which indicates that the spark application is not retried when it fails. If you set this parameter to 3, the Spark application is retried up to three times within a sliding window.
spark.adb.attemptFailuresValidityInterval	Long.MAX	The duration of the sliding window. Unit: milliseconds. For example, if you set this parameter to 6000, the system counts the number of retries within the last 6,000 milliseconds after a retry fails. If the number of retries is less than the value of spark.adb.maxAttempts, the system continues to retry the application.

Source configuration


Parameter	Default value	Description
spark.adb.driver.cpu-vcores-ratio	1	The ratio of vCPUs to actual CPU cores used by the driver. For example, if the driver uses the medium resource specifications of 2 cores and 8 GB memory and you set this parameter to 2, the driver can run 4 vCPUs in parallel. You can also set spark.driver.cores to 4 to achieve the same performance.
spark.adb.executor.cpu-vcores-ratio	1	The ratio of vCPUs to actual CPU cores used by an executor. If the CPU utilization of a task is low, you can configure this parameter to increase the CPU utilization. For example, if an executor uses the medium resource specifications of 2 cores and 8 GB memory and you set this parameter to 2, the executor can run 4 vCPUs in parallel. This means that 4 tasks are scheduled in parallel. You can also set spark.executor.cores to 4 to achieve the same performance.

Spark resource specifications


Type	Specifications
Type	CPU cores	Memory (GB)
c.small	1	2
small	1	4
m.small	1	8
c.medium	2	4
medium	2	8
m.medium	2	16
c.large	4	8
large	4	16
m.large	4	32
c.xlarge	8	16
xlarge	8	32
m.xlarge	8	64
c.2xlarge	16	32
2xlarge	16	64
m.2xlarge	16	128