All Products
Search
Document Center

AnalyticDB:Spark application configuration parameters

Last Updated:Sep 27, 2024

The configuration parameters of AnalyticDB for MySQL Spark are similar to those of Apache Spark. This topic describes the configuration parameters of AnalyticDB for MySQL Spark that are different from those of Apache Spark.

Usage notes

Spark application configuration parameters are used to configure and adjust the behavior and performance of Spark applications. The format of Spark application configuration parameters varies based on Spark development tools.

Development tool

Configuration parameter format

Example

SQL editor

set key=value;

set spark.sql.hive.metastore.version=adb;

Spark JAR editor

"key": "value"

"spark.sql.hive.metastore.version":"adb"

Notebook editor

"key": "value"

"spark.sql.hive.metastore.version":"adb"

spark-submit command-line tool

key=value

spark.sql.hive.metastore.version=adb

Specify the Spark driver and executor resources

Parameter

Required

Description

Corresponding parameter in Apache Spark

spark.driver.resourceSpec

Yes

The resource specifications of the Spark driver. Each type corresponds to distinct specifications. For more information, see the Type column in the "Spark resource specifications" table of this topic.

Important

If you submit Spark applications, you can use Apache Spark parameters and configure the parameters based on the values of cores and memory that are described in the "Spark resource specifications" table of this topic.

Example: CONF spark.driver.resourceSpec = c.small;. In this example, the Spark driver provides 1 core and 2 GB memory.

spark.driver.cores and spark.driver.memory

spark.executor.resourceSpec

Yes

The resource specifications of each Spark executor. Each type corresponds to distinct specifications. For more information, see the Type column in the "Spark resource specifications" table of this topic.

Important

If you submit Spark applications, you can use Apache Spark parameters and configure the parameters based on the values of cores and memory that are described in the "Spark resource specifications" table of this topic.

Example: CONF spark.executor.resourceSpec = c.small;. In this example, each Spark executor provides 1 core and 2 GB memory.

spark.executor.cores and spark.executor.memory

spark.adb.driverDiskSize

No

The size of additional disk storage that is mounted on the Spark driver to meet large disk storage requirements. By default, the additional disk storage is mounted on the /user_data_dir directory.

Unit: GiB. Valid values: (0,100]. Example: spark.adb.driverDiskSize=50Gi. In this example, the additional disk storage that is mounted on the Spark driver is set to 50 GiB.

N/A

spark.adb.executorDiskSize

No

The size of additional disk storage that is mounted on a Spark executor to meet the requirements of shuffle operations. By default, the additional disk storage is mounted on the /shuffle_volume directory.

Unit: GiB. Valid values: (0,100]. Example: spark.adb.executorDiskSize=50Gi. In this example, the additional disk storage that is mounted on a Spark executor is set to 50 GiB.

N/A

Spark resource specifications

Important

You can use reserved resources or elastic resources to execute Spark jobs. If you use the on-demand elastic resources of a job resource group to execute Spark jobs, the system calculates the number of used AnalyticDB compute units (ACUs) based on the Spark resource specifications and the CPU-to-memory ratio by using the following formulas:

  • 1:2 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 0.8.

  • 1:4 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 1.

  • 1:8 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 1.5.

For information about the prices of on-demand elastic resources, see Pricing for Data Lakehouse Edition.

Spark resource specifications

Type

Specifications

Used ACUs

CPU cores

Memory (GB)

Disk storage1 (GB)

c.small

1

2

20

0.8

small

1

4

20

1

m.small

1

8

20

1.5

c.medium

2

4

20

1.6

medium

2

8

20

2

m.medium

2

16

20

3

c.large

4

8

20

3.2

large

4

16

20

4

m.large

4

32

20

6

c.xlarge

8

16

20

6.4

xlarge

8

32

20

8

m.xlarge

8

64

20

12

c.2xlarge

16

32

20

12.8

2xlarge

16

64

20

16

m.2xlarge

16

128

20

24

m.4xlarge

32

256

20

48

m.8xlarge

64

512

20

96

Note

1Disk storage: The system is expected to occupy approximately 1% of the disk storage. The actual available disk storage may be less than 20 GB.

Specify priorities for Spark jobs

Parameter

Required

Default value

Description

spark.adb.priority

No

NORMAL

The priority of a Spark job. If resources are insufficient to execute all Spark jobs that are submitted, the queued jobs that have higher priorities are first executed. Valid values:

  • HIGH

  • NORMAL

  • LOW

  • LOWEST

Important

We recommend that you set this parameter to HIGH for all streaming Spark jobs.

Access the metadata

Parameter

Required

Default value

Description

spark.sql.catalogImplementation

No

  • Spark SQL jobs: hive

  • Non-Spark SQL jobs: in-memory

The type of the metadata to be accessed. Valid values:

  • hive: the metadata in the built-in Hive Metastore of Apache Spark.

  • in-memory: the metadata in the temporary directory.

spark.sql.hive.metastore.version

No

  • Spark SQL jobs: adb

  • Non-Spark SQL jobs: <hive_version>

The version of the metastore service. Valid values:

  • adb: the version of AnalyticDB for MySQL.

  • <hive_version>: the version of the Hive Metastore.

Note
  • For information about the Hive versions that are supported by Apache Spark, see Spark Configuration.

  • To access a self-managed Hive Metastore, you can replace the default configuration with the standard Apache Spark configuration. For more information, see Spark Configuration.

Examples

  • Configure the following setting to access the metadata in AnalyticDB for MySQL:

    spark.sql.hive.metastore.version=adb;
  • Configure the following settings to access the metadata in the built-in Hive Metastore of Apache Spark:

    spark.sql.catalogImplementation=hive;
    spark.sql.hive.metastore.version=2.1.3;
  • Configure the following setting to access the metadata in the temporary directory:

    spark.sql.catalogImplementation=in-memory;

Configure the Spark UI

Parameter

Required

Default value

Description

spark.app.log.rootPath

No

oss://<aliyun-oa-adb-spark-Alibaba Cloud account ID-oss-Zone ID>/<Cluster ID>/<Spark application ID>

The directory in which the AnalyticDB for MySQL Spark job logs and output data of the Linux OS are stored.

By default, the folder named Spark application ID contains the following content:

  • The file named Spark application ID-000X that stores the Spark event logs used for Spark UI rendering.

  • The folders named driver and numbers that store the logs of the corresponding nodes.

  • The folders named stdout and stderr that store the output data of the Linux OS.

spark.adb.event.logUploadDuration

No

false

Specifies whether to record the duration of an event log upload.

spark.adb.buffer.maxNumEvents

No

1000

The maximum number of events that are cached by the driver.

spark.adb.payload.maxNumEvents

No

10000

The maximum number of events that can be uploaded to Object Storage Service (OSS) at a time.

spark.adb.event.pollingIntervalSecs

No

0.5

The interval between two uploads of events to OSS. Unit: seconds. For example, a value of 0.5 indicates that events are uploaded every 0.5 seconds.

spark.adb.event.maxPollingIntervalSecs

No

60

The maximum retry interval when an event upload to OSS fails. Unit: seconds. The interval between a failed upload and an upload retry must be within the range of the spark.adb.event.pollingIntervalSecs to spark.adb.event.maxPollingIntervalSecs values.

spark.adb.event.maxWaitOnEndSecs

No

10

The maximum wait time for uploading events to OSS. Unit: seconds. The maximum wait time is the interval between the start and completion of an upload. If an upload is not complete within the maximum wait time, the upload is retried.

spark.adb.event.waitForPendingPayloadsSleepIntervalSecs

No

1

The required wait time for retrying an upload that fails to be complete within the spark.adb.event.maxWaitOnEndSecs value. Unit: seconds.

spark.adb.eventLog.rolling.maxFileSize

No

209715200

The maximum file size of event logs in OSS. Unit: bytes. Event logs are stored in OSS in the form of multiple files, such as Eventlog.0 and Eventlog.1. You can specify the file size.

Grant permissions to RAM users

Parameter

Required

Default value

Description

spark.adb.roleArn

No

N/A

The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that you want to attach to the RAM user in the RAM console to grant the RAM user the permissions to submit Spark applications. For more information, see RAM role overview. If you submit Spark applications as a RAM user, you must specify this parameter. If you submit Spark applications with an Alibaba Cloud account, you do not need to specify this parameter.

Note

If you have granted permissions to a RAM user in the RAM console, you do not need to specify this parameter. For more information, see Perform authorization.

Enable the built-in data source connectors

Parameter

Required

Default value

Description

spark.adb.connectors

No

N/A

The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, and jindo.

spark.hadoop.io.compression.codec.snappy.native

No

false

Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes the Snappy files that are edited in Hadoop. If you set this parameter to true, the standard Snappy library is used for decompression. If you set this parameter to false, the default Snappy library of Hadoop is used for decompression.

Enable VPC access and data source access

Parameter

Required

Default value

Description

spark.adb.eni.enabled

No

false

Specifies whether to enable Elastic Network Interface (ENI).

If you use external tables to access other external data sources, you must enable ENI. Valid values:

  • true

  • false

spark.adb.eni.vswitchId

No

N/A

The ID of the vSwitch that is associated with an ENI.

If you connect to AnalyticDB for MySQL from an Elastic Compute Service (ECS) instance over a virtual private cloud (VPC), you must specify a vSwitch ID for the VPC.

Note

If you have enabled VPC access, you must set the spark.adb.eni.enabled parameter to true.

spark.adb.eni.securityGroupId

No

N/A

The ID of the security group that is associated with an ENI.

If you connect to AnalyticDB for MySQL from an ECS instance over a VPC, you must specify a security group ID.

Note

If you have enabled VPC access, you must set the spark.adb.eni.enabled parameter to true.

spark.adb.eni.extraHosts

No

N/A

The mappings between IP addresses and hostnames. This parameter allows Spark to resolve the hostnames of data sources. If you want to access a self-managed Hive data source, you must specify this parameter.

Note
  • Separate IP addresses and hostnames with spaces. Separate multiple groups of IP addresses and hostnames with commas (,). Example: ip0 master0,ip1 master1.

  • If you have enabled data source access, you must set the spark.adb.eni.enabled parameter to true.

spark.adb.eni.adbHostAlias.enabled

No

false

Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values:

  • true

  • false

If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.

Configure application retries

Parameter

Required

Default value

Description

spark.adb.maxAttempts

No

1

The maximum number of attempts that are allowed to run an application. The default value is 1, which specifies that no retry attempts are allowed.

If you set this parameter to 3 for a Spark application, the system attempts to run the application up to three times within a sliding window.

spark.adb.attemptFailuresValidityInterval

No

Integer.MAX

The duration of the sliding window within which the system attempts to rerun an application. Unit: seconds.

For example, if you set this parameter to 6000 for a Spark application, the system counts the number of attempts within the last 6,000 seconds after a failed run. If the number of attempts is less than the value of the spark.adb.maxAttempts parameter, the system retries to run the application.

Specify a runtime environment for Spark jobs

The following table describes the configuration parameters that are required when you use the virtual environments technology to package a Python environment and submit Spark jobs.

Parameter

Required

Default value

Description

spark.pyspark.python

No

N/A

The path of the Python interpreter on your on-premises device.

Specify the Spark version

Parameter

Required

Default value

Description

spark.adb.version

No

3.2

The Spark version. Valid values:

  • 2.4

  • 3.2

  • 3.3

  • 3.5

Configuration parameters not supported by AnalyticDB for MySQL

AnalyticDB for MySQL Spark does not support the following configuration parameters of Apache Spark. These configuration parameters do not take effect on AnalyticDB for MySQL Spark.

Useless options(these options will be ignored):
  --deploy-mode
  --master
  --packages, please use `--jars` instead
  --exclude-packages
  --proxy-user
  --repositories
  --keytab
  --principal
  --queue
  --total-executor-cores
  --driver-library-path
  --driver-class-path
  --supervise
  -S,--silent
  -i <filename>