This topic provides answers to some frequently asked questions about Spark on MaxCompute.
Category | FAQ |
Development based on Spark | |
Job errors | |
How do I perform self-check on my project?
We recommend that you check the following items:
How do I run an ODPS Spark node in DataWorks?
Modify and package the Spark code in an on-premises Python environment. Make sure that the version of the on-premises Python environment is Python 2.7.
Upload the resource package to DataWorks. For more information, see Create and use MaxCompute resources.
Create an ODPS Spark node in DataWorks. For more information, see Create an ODPS Spark node.
Write code and run the node. Then, view the execution result in the DataWorks console.
How do I debug Spark on MaxCompute in an on-premises environment?
Use IntelliJ IDEA to debug Spark on MaxCompute in an on-premises environment. For more information, see Set up a Linux development environment.
How do I reference a JAR file as a resource?
Use the spark.hadoop.odps.cupid.resources
parameter to specify the resource that you want to reference. Resources can be shared by multiple projects. We recommend that you configure relevant permissions to ensure data security. The following configuration shows an example:
spark.hadoop.odps.cupid.resources = projectname.xx0.jar,projectname.xx1.jar
How do I pass parameters by using Spark on MaxCompute?
For more information about how to pass parameters by using Spark on MaxCompute, see Spark on DataWorks.
How do I write the DataHub data that is read by Spark in streaming mode to MaxCompute?
For the sample code, visit DataHub on GitHub.
How do I migrate open source Spark code to Spark on MaxCompute?
Select one of the following migration solutions based on your job scenarios:
How do I use Spark on MaxCompute to process data in a MaxCompute table?
Use Spark on MaxCompute to process data in a MaxCompute table in local, cluster, or DataWorks mode. For configuration differences among the three modes, see Running modes.
How do I configure the resource parallelism for Spark on MaxCompute?
The resource parallelism of Spark on MaxCompute is determined based on the number of executors and the number of CPU cores on each executor. The maximum number of tasks that you can run in parallel is calculated by using the following formula: Number of executors × Number of CPU cores on each executor
.
How do I resolve OOM issues?
Common errors:
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: GC overhead limit exceeded
Cannot allocate memory
The job has been killed by "OOM Killer", please check your job's memory usage
Solutions:
Configure the memory size for each executor.
Parameter: spark.executor.memory
.
Parameter description: This parameter specifies the memory size of each executor. A ratio of 1:4
between spark.executor.cores
and spark.executor.memory is recommended. For example, if spark.executor.cores
is set to 1
, you can set spark.executor.memory
to 4 GB
. If the error message java.lang.OutOfMemoryError
is reported for an executor, you need to increase the parameter value.
Configure the off-heap memory for each executor.
Parameter: spark.executor.memoryOverhead
.
Parameter description: This parameter specifies the additional memory size of each executor. The additional memory is mainly used for the overheads of Java virtual machines (JVMs), strings, and NIO buffers. The default memory size is calculated by using the following formula: spark.executor.memory × 0.1
. The minimum size is 384 MB. In most cases, you do not need to change the default value. If the error message Cannot allocate memory
or OOM Killer is logged for an executor, you need to increase the parameter value.
Configure the driver memory.
Parameter: spark.driver.memory
.
Parameter description: This parameter specifies the memory size of the driver. A ratio of 1:4
between spark.driver.cores
and spark.driver.memory is recommended. If the driver needs to collect a large amount of data or the error message java.lang.OutOfMemoryError
is reported, you need to increase the parameter value.
Configure the off-heap memory for the driver.
Parameter: spark.driver.memoryOverhead
.
Parameter description: This parameter specifies the additional memory size of the driver. The default size is calculated by using the following formula: spark.driver.memory × 0.1
. The minimum size is 384 MB. If the error message Cannot allocate memory
is logged for the driver, you need to increase the parameter value.
What do I do if the disk space is insufficient?
How do I reference resources in MaxCompute projects?
Use one of the following methods to access resources in MaxCompute:
How do I use Spark on MaxCompute to access a VPC?
Use one of the following methods to access services in Alibaba Cloud VPCs:
Reverse access
ENI-based access
How do I use Spark on MaxCompute to access the Internet?
Use one of the following methods to access the Internet:
SmartNAT-based access
In this example, you need to access https://aliyundoc.com:443
. Perform the following steps:
Submit a ticket or search for the DingTalk group (ID: 11782920) and join the MaxCompute developer community. Then, ask MaxCompute technical support engineers to add https://aliyundoc.com:443
to odps.security.outbound.internetlist
.
Use the following settings to configure a whitelist for access over the Internet and enable SmartNAT for your Spark job.
spark.hadoop.odps.cupid.internet.access.list=aliyundoc.com:443
spark.hadoop.odps.cupid.smartnat.enable=true
ENI-based access
Create an ENI by following the instructions in Access instances in a VPC from Spark on MaxCompute.
Confirm that the VPC can access the Internet by using an ENI. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
Use the following settings to configure a whitelist for access over the Internet and enable the ENI for your Spark job. Replace region with the actual region ID and vpcid with the actual VPC ID.
spark.hadoop.odps.cupid.internet.access.list=aliyundoc.com:443
spark.hadoop.odps.cupid.eni.enable=true
spark.hadoop.odps.cupid.eni.info=[region]:[vpcid]
How do I use Spark on MaxCompute to access OSS?
Spark on MaxCompute allows you to use Jindo SDK to access Alibaba Cloud OSS. You must configure the following information:
How do I reference a third-party Python library?
Problem description: When a PySpark job is running, the error message No module named 'xxx'
is reported.
Possible causes: PySpark jobs depend on third-party Python libraries. However, the third-party Python libraries are not installed in the default Python environment of the current MaxCompute platform.
Solutions: Use one of the following solutions to add third-party library dependencies.
Directly use the MaxCompute Python public environment.
You need to only add the following configurations to the DataWorks parameters or spark-defaults.conf
file. The following code shows the configurations of different Python versions.
Python 2
spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python
https://odps-repo.oss-cn-hangzhou.aliyuncs.com/pyspark/py27/py27-default_req.txt.txt
Python 3
spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz
spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3
https://odps-repo.oss-cn-hangzhou.aliyuncs.com/pyspark/py37/py37-default_req.txt
Upload a single wheel package.
This solution is suitable for scenarios where a small number of third-party Python library dependencies are required and the dependencies are relatively simple. The following code shows an example.
## Rename the wheel package as a ZIP file. For example, rename the pymysql wheel package as pymysql.zip.
## Upload the pymysql.zip file as a resource of the archive type.
## Reference the archive file on the DataWorks Spark node.
## Add the following configurations to the spark-defaults.conf file or DataWorks parameters and perform the import operation.
## Add the configurations.
spark.executorEnv.PYTHONPATH=pymysql
spark.yarn.appMasterEnv.PYTHONPATH=pymysql
## Upload code.
import pymysql
Upload a complete custom Python environment.
This solution is suitable for scenarios where dependencies are complex or a custom Python version is required. You need to use a Docker container to package and upload the complete Python environment. For more information, see the "Upload required packages" section in Develop a Spark on MaxCompute application by using PySpark.
How do I resolve JAR dependency conflicts?
Problem description: The error message NoClassDefFoundError or NoSuchMethodError
is reported during runtime.
Possible causes: The versions of third-party dependencies in JAR files conflict with the versions of Spark dependencies. You need to check the main JAR file that you upload and third-party dependency libraries and identify the dependency that causes the version conflict.
Solutions:
How do I debug Spark on MaxCompute in local mode?
Spark 2.3.0
Add the following configurations to the spark-defaults.conf file.
spark.hadoop.odps.project.name =<Yourprojectname>
spark.hadoop.odps.access.id =<YourAccessKeyID>
spark.hadoop.odps.access.key =<YourAccessKeySecret>
spark.hadoop.odps.end.point =<endpoint>
Run your job in local mode.
./bin/spark-submit --master local spark_sql.py
Spark 2.4.5/Spark 3.1.1
Create a file named odps.conf
and add the following configurations to the file.
odps.access.id=<YourAccessKeyID>
odps.access.key=<YourAccessKeySecret>
odps.end.point=<endpoint>
odps.project.name=<Yourprojectname>
Add an environment variable to point to the path of the odps.conf
file.
export ODPS_CONF_FILE=/path/to/odps.conf
Run your job in local mode.
./bin/spark-submit --master local spark_sql.py
Common errors
Error 1:
Error 2:
Error message: Cannot create CupidSession with empty CupidConf
.
Possible causes: Spark 2.4.5 or Spark 3.1.1 cannot read information such as odps.access.id
.
Solutions: Create the odps.conf
file, add the environment variable to the file, and then run your job.
Error 3:
Error message: java.util.NoSuchElementException: odps.access.id
.
Possible causes: Spark 2.3.0 cannot read information such as odps.access.id
.
Solutions: Add configuration information such as spark.hadoop.odps.access.id
to the spark-defaults.conf
file.
What do I do if the error message "User signature does not match" is reported when I run a Spark job?
Problem description
The following error message is reported when a Spark job is running:
Stack:
com.aliyun.odps.OdpsException: ODPS-0410042:
Invalid signature value - User signature does not match
Possible causes
The identity authentication failed. The AccessKey ID or AccessKey secret is invalid.
Solutions
Check whether the AccessKey ID and AccessKey secret in the spark-defaults.conf file are the same as the AccessKey ID and AccessKey secret in User Management in the Alibaba Cloud console. If they are not the same, modify the AccessKey ID and AccessKey secret in the file.
What do I do if the error message "You have NO privilege" is reported when I run a Spark job?
Problem description
The following error message is reported when a Spark job is running:
Stack:
com.aliyun.odps.OdpsException: ODPS-0420095:
Access Denied - Authorization Failed [4019], You have NO privilege 'odps:CreateResource' on {acs:odps:*:projects/*}
Possible causes
You do not have the required permissions.
Solutions
Ask the project owner to grant the Read and Create permissions on the resource to your account. For more information about authorization, see MaxCompute permissions.
What do I do if the error message "Access Denied" is reported when I run a Spark job?
What do I do if the error message "No space left on device" is reported when I run a Spark job?
Spark on MaxCompute uses disks for local storage. Both the shuffled data and the data that overflows from the BlockManager are all stored on disks. You can specify the disk size by using the spark.hadoop.odps.cupid.disk.driver.device_size parameter. The default value is 20 GB, and the maximum value is 100 GB. If the issue persists after you increase the disk space to 100 GB, you need to further analyze the issue. The most common cause is data skew. Data is centrally distributed in specific blocks during the shuffling and caching processes. In this case, change the value of the spark.executor.cores parameter to decrease the number of CPU cores on a single executor and change the value of the spark.executor.instances parameter to increase the number of executors.
What do I do if the error message "Table or view not found" is reported when I run a Spark job?
What do I do if the error message "Shutdown hook called before final status was reported" is reported when I run a Spark job?
Problem description
The following error message is reported when a Spark job is running:
App Status: SUCCEEDED, diagnostics: Shutdown hook called before final status was reported.
Possible causes
The main method that is executed for the cluster does not request cluster resources by using ApplicationMaster. For example, the user does not create a SparkContext or the user sets spark.master to local in code.
What do I do if a JAR package version conflict occurs when I run a Spark job?
What do I do if the error message "ClassNotFound" is reported when I run a Spark job?
What do I do if the error message "The task is not in release range" is reported when I run a Spark job?
Problem description
The following error message is reported when a Spark job is running:
The task is not in release range: CUPID
Possible causes
The Spark on MaxCompute service is not activated in the region where the project resides.
Solutions
Select a region where the Spark on MaxCompute service is activated.
What do I do if the error message "java.io.UTFDataFormatException" is reported when I run a Spark job?
Problem description
The following error message is reported when a Spark job is running:
java.io.UTFDataFormatException: encoded string too long: 2818545 bytes
Solutions
Change the value of the spark.hadoop.odps.cupid.disk.driver.device_size parameter in the spark-defaults.conf file. The default value is 20 GB, and the maximum value is 100 GB.
What do I do if garbled Chinese characters are printed when I run a Spark job?
Add the following configurations:
"--conf" "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"
"--conf" "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"
What do I do if an error message is reported when Spark on MaxCompute calls a third-party task over the Internet?
Spark on MaxCompute cannot call third-party tasks over the Internet because the network connection is disconnected.
To resolve the issue, build an NGINX reverse proxy in a VPC and access the Internet by using the proxy. Spark on MaxCompute supports direct access to a VPC. For more information, see Access instances in a VPC from Spark on MaxCompute.