The configuration parameters of AnalyticDB for MySQL Spark are similar to those of Apache Spark. This topic describes the configuration parameters of AnalyticDB for MySQL Spark that are different from those of Apache Spark.
Usage notes
Spark application configuration parameters are used to configure and adjust the behavior and performance of Spark applications. The format of Spark application configuration parameters varies based on Spark development tools.
Development tool | Configuration parameter format | Example |
SQL editor | set key=value; |
|
Spark JAR editor | "key": "value" |
|
Notebook editor | "key": "value" |
|
spark-submit command-line tool | key=value |
|
Specify the Spark driver and executor resources
Parameter | Required | Description | Corresponding parameter in Apache Spark |
spark.driver.resourceSpec | Yes | The resource specifications of the Spark driver. Each type corresponds to distinct specifications. For more information, see the Type column in the "Spark resource specifications" table of this topic. Important If you submit Spark applications, you can use Apache Spark parameters and configure the parameters based on the values of cores and memory that are described in the "Spark resource specifications" table of this topic. Example: | spark.driver.cores and spark.driver.memory |
spark.executor.resourceSpec | Yes | The resource specifications of each Spark executor. Each type corresponds to distinct specifications. For more information, see the Type column in the "Spark resource specifications" table of this topic. Important If you submit Spark applications, you can use Apache Spark parameters and configure the parameters based on the values of cores and memory that are described in the "Spark resource specifications" table of this topic. Example: | spark.executor.cores and spark.executor.memory |
spark.adb.driverDiskSize | No | The size of additional disk storage that is mounted on the Spark driver to meet large disk storage requirements. By default, the additional disk storage is mounted on the /user_data_dir directory. Unit: GiB. Valid values: (0,100]. Example: spark.adb.driverDiskSize=50Gi. In this example, the additional disk storage that is mounted on the Spark driver is set to 50 GiB. | N/A |
spark.adb.executorDiskSize | No | The size of additional disk storage that is mounted on a Spark executor to meet the requirements of shuffle operations. By default, the additional disk storage is mounted on the /shuffle_volume directory. Unit: GiB. Valid values: (0,100]. Example: spark.adb.executorDiskSize=50Gi. In this example, the additional disk storage that is mounted on a Spark executor is set to 50 GiB. | N/A |
Spark resource specifications
You can use reserved resources or elastic resources to execute Spark jobs. If you use the on-demand elastic resources of a job resource group to execute Spark jobs, the system calculates the number of used AnalyticDB compute units (ACUs) based on the Spark resource specifications and the CPU-to-memory ratio by using the following formulas:
1:2 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 0.8.
1:4 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 1.
1:8 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 1.5.
For information about the prices of on-demand elastic resources, see Pricing for Data Lakehouse Edition.
Spark resource specifications
Type | Specifications | Used ACUs | ||
CPU cores | Memory (GB) | Disk storage1 (GB) | ||
c.small | 1 | 2 | 20 | 0.8 |
small | 1 | 4 | 20 | 1 |
m.small | 1 | 8 | 20 | 1.5 |
c.medium | 2 | 4 | 20 | 1.6 |
medium | 2 | 8 | 20 | 2 |
m.medium | 2 | 16 | 20 | 3 |
c.large | 4 | 8 | 20 | 3.2 |
large | 4 | 16 | 20 | 4 |
m.large | 4 | 32 | 20 | 6 |
c.xlarge | 8 | 16 | 20 | 6.4 |
xlarge | 8 | 32 | 20 | 8 |
m.xlarge | 8 | 64 | 20 | 12 |
c.2xlarge | 16 | 32 | 20 | 12.8 |
2xlarge | 16 | 64 | 20 | 16 |
m.2xlarge | 16 | 128 | 20 | 24 |
m.4xlarge | 32 | 256 | 20 | 48 |
m.8xlarge | 64 | 512 | 20 | 96 |
1Disk storage: The system is expected to occupy approximately 1% of the disk storage. The actual available disk storage may be less than 20 GB.
Specify priorities for Spark jobs
Parameter | Required | Default value | Description |
spark.adb.priority | No | NORMAL | The priority of a Spark job. If resources are insufficient to execute all Spark jobs that are submitted, the queued jobs that have higher priorities are first executed. Valid values:
Important We recommend that you set this parameter to HIGH for all streaming Spark jobs. |
Access the metadata
Parameter | Required | Default value | Description |
spark.sql.catalogImplementation | No |
| The type of the metadata to be accessed. Valid values:
|
spark.sql.hive.metastore.version | No |
| The version of the metastore service. Valid values:
Note
|
Examples
Configure the following setting to access the metadata in AnalyticDB for MySQL:
spark.sql.hive.metastore.version=adb;
Configure the following settings to access the metadata in the built-in Hive Metastore of Apache Spark:
spark.sql.catalogImplementation=hive; spark.sql.hive.metastore.version=2.1.3;
Configure the following setting to access the metadata in the temporary directory:
spark.sql.catalogImplementation=in-memory;
Configure the Spark UI
Parameter | Required | Default value | Description |
spark.app.log.rootPath | No |
| The directory in which the AnalyticDB for MySQL Spark job logs and output data of the Linux OS are stored. By default, the folder named Spark application ID contains the following content:
|
spark.adb.event.logUploadDuration | No | false | Specifies whether to record the duration of an event log upload. |
spark.adb.buffer.maxNumEvents | No | 1000 | The maximum number of events that are cached by the driver. |
spark.adb.payload.maxNumEvents | No | 10000 | The maximum number of events that can be uploaded to Object Storage Service (OSS) at a time. |
spark.adb.event.pollingIntervalSecs | No | 0.5 | The interval between two uploads of events to OSS. Unit: seconds. For example, a value of 0.5 indicates that events are uploaded every 0.5 seconds. |
spark.adb.event.maxPollingIntervalSecs | No | 60 | The maximum retry interval when an event upload to OSS fails. Unit: seconds. The interval between a failed upload and an upload retry must be within the range of the |
spark.adb.event.maxWaitOnEndSecs | No | 10 | The maximum wait time for uploading events to OSS. Unit: seconds. The maximum wait time is the interval between the start and completion of an upload. If an upload is not complete within the maximum wait time, the upload is retried. |
spark.adb.event.waitForPendingPayloadsSleepIntervalSecs | No | 1 | The required wait time for retrying an upload that fails to be complete within the |
spark.adb.eventLog.rolling.maxFileSize | No | 209715200 | The maximum file size of event logs in OSS. Unit: bytes. Event logs are stored in OSS in the form of multiple files, such as Eventlog.0 and Eventlog.1. You can specify the file size. |
Grant permissions to RAM users
Parameter | Required | Default value | Description |
spark.adb.roleArn | No | N/A | The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that you want to attach to the RAM user in the RAM console to grant the RAM user the permissions to submit Spark applications. For more information, see RAM role overview. If you submit Spark applications as a RAM user, you must specify this parameter. If you submit Spark applications with an Alibaba Cloud account, you do not need to specify this parameter. Note If you have granted permissions to a RAM user in the RAM console, you do not need to specify this parameter. For more information, see Perform authorization. |
Enable the built-in data source connectors
Parameter | Required | Default value | Description |
spark.adb.connectors | No | N/A | The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, and jindo. |
spark.hadoop.io.compression.codec.snappy.native | No | false | Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes the Snappy files that are edited in Hadoop. If you set this parameter to true, the standard Snappy library is used for decompression. If you set this parameter to false, the default Snappy library of Hadoop is used for decompression. |
Enable VPC access and data source access
Parameter | Required | Default value | Description |
spark.adb.eni.enabled | No | false | Specifies whether to enable Elastic Network Interface (ENI). If you use external tables to access other external data sources, you must enable ENI. Valid values:
|
spark.adb.eni.vswitchId | No | N/A | The ID of the vSwitch that is associated with an ENI. If you connect to AnalyticDB for MySQL from an Elastic Compute Service (ECS) instance over a virtual private cloud (VPC), you must specify a vSwitch ID for the VPC. Note If you have enabled VPC access, you must set the spark.adb.eni.enabled parameter to true. |
spark.adb.eni.securityGroupId | No | N/A | The ID of the security group that is associated with an ENI. If you connect to AnalyticDB for MySQL from an ECS instance over a VPC, you must specify a security group ID. Note If you have enabled VPC access, you must set the spark.adb.eni.enabled parameter to true. |
spark.adb.eni.extraHosts | No | N/A | The mappings between IP addresses and hostnames. This parameter allows Spark to resolve the hostnames of data sources. If you want to access a self-managed Hive data source, you must specify this parameter. Note
|
spark.adb.eni.adbHostAlias.enabled | No | false | Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values:
If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true. |
Configure application retries
Parameter | Required | Default value | Description |
spark.adb.maxAttempts | No | 1 | The maximum number of attempts that are allowed to run an application. The default value is 1, which specifies that no retry attempts are allowed. If you set this parameter to 3 for a Spark application, the system attempts to run the application up to three times within a sliding window. |
spark.adb.attemptFailuresValidityInterval | No | Integer.MAX | The duration of the sliding window within which the system attempts to rerun an application. Unit: seconds. For example, if you set this parameter to 6000 for a Spark application, the system counts the number of attempts within the last 6,000 seconds after a failed run. If the number of attempts is less than the value of the spark.adb.maxAttempts parameter, the system retries to run the application. |
Specify a runtime environment for Spark jobs
The following table describes the configuration parameters that are required when you use the virtual environments technology to package a Python environment and submit Spark jobs.
Parameter | Required | Default value | Description |
spark.pyspark.python | No | N/A | The path of the Python interpreter on your on-premises device. |
Specify the Spark version
Parameter | Required | Default value | Description |
spark.adb.version | No | 3.2 | The Spark version. Valid values:
|
Configuration parameters not supported by AnalyticDB for MySQL
AnalyticDB for MySQL Spark does not support the following configuration parameters of Apache Spark. These configuration parameters do not take effect on AnalyticDB for MySQL Spark.
Useless options(these options will be ignored):
--deploy-mode
--master
--packages, please use `--jars` instead
--exclude-packages
--proxy-user
--repositories
--keytab
--principal
--queue
--total-executor-cores
--driver-library-path
--driver-class-path
--supervise
-S,--silent
-i <filename>