Promo Center

50% off for new user

Direct Mail-46% off

Learn More

FAQ about Spark on MaxCompute

Updated at: 2025-01-25 09:05

This topic provides answers to some frequently asked questions about Spark on MaxCompute.

Category

FAQ

Category

FAQ

Development based on Spark

Job errors

How do I perform self-check on my project?

We recommend that you check the following items:

  • pom.xml file

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope> // The scope of the spark-xxxx_${scala.binary.version} dependency must be provided. 
    </dependency>
  • Main class spark.master

    val spark = SparkSession
          .builder()
          .appName("SparkPi")
          .config("spark.master", "local[4]") // If the local[N] configuration is contained in code when you submit a job in yarn-cluster mode, an error is reported. 
          .getOrCreate()
  • Main class Scala code

    object SparkPi { // An object must be defined. If you write a class instead of an object in code when you create a file in IntelliJ IDEA, the main function cannot be loaded. 
      def main(args: Array[String]) {
        val spark = SparkSession
          .builder()
          .appName("SparkPi")
          .getOrCreate()
  • Main class code configuration

    val spark = SparkSession
          .builder()
          .appName("SparkPi")
          .config("key1", "value1")
          .config("key2", "value2")
          .config("key3", "value3")
          ...  // If MaxCompute configurations are hardcoded in code during a local test, specific configurations cannot take effect. 
          .getOrCreate()
    Note

    We recommend that you add all configuration items to the spark-defaults.conf file when you submit jobs in yarn-cluster mode.

How do I run an ODPS Spark node in DataWorks?

  1. Modify and package the Spark code in an on-premises Python environment. Make sure that the version of the on-premises Python environment is Python 2.7.

  2. Upload the resource package to DataWorks. For more information, see Create and use MaxCompute resources.

  3. Create an ODPS Spark node in DataWorks. For more information, see Create an ODPS Spark node.

  4. Write code and run the node. Then, view the execution result in the DataWorks console.

How do I debug Spark on MaxCompute in an on-premises environment?

Use IntelliJ IDEA to debug Spark on MaxCompute in an on-premises environment. For more information, see Set up a Linux development environment.

How do I use Spark on MaxCompute to access services in a VPC?

For more information about how to use Spark on MaxCompute to access services in a virtual private cloud (VPC), see Access instances in a VPC from Spark on MaxCompute.

How do I reference a JAR file as a resource?

Use the spark.hadoop.odps.cupid.resources parameter to specify the resource that you want to reference. Resources can be shared by multiple projects. We recommend that you configure relevant permissions to ensure data security. The following configuration shows an example:

spark.hadoop.odps.cupid.resources = projectname.xx0.jar,projectname.xx1.jar 

How do I pass parameters by using Spark on MaxCompute?

For more information about how to pass parameters by using Spark on MaxCompute, see Spark on DataWorks.

How do I write the DataHub data that is read by Spark in streaming mode to MaxCompute?

For the sample code, visit DataHub on GitHub.

How do I migrate open source Spark code to Spark on MaxCompute?

Select one of the following migration solutions based on your job scenarios:

How do I use Spark on MaxCompute to process data in a MaxCompute table?

Use Spark on MaxCompute to process data in a MaxCompute table in local, cluster, or DataWorks mode. For configuration differences among the three modes, see Running modes.

How do I configure the resource parallelism for Spark on MaxCompute?

The resource parallelism of Spark on MaxCompute is determined based on the number of executors and the number of CPU cores on each executor. The maximum number of tasks that you can run in parallel is calculated by using the following formula: Number of executors × Number of CPU cores on each executor.

  • Number of executors

    • Parameter: spark.executor.instances.

    • Parameter description: This parameter specifies the number of executors requested by a job.

  • Number of CPU cores on each executor

    • Parameter: spark.executor.cores.

    • Parameter description: This parameter specifies the number of CPU cores on each executor process. This parameter determines the ability of each executor process to run tasks in parallel. Each CPU core can run only one task at a time. In most cases, we recommend that you set the number of CPU cores on each executor to 2, 3, or 4.

How do I resolve OOM issues?

  • Common errors:

    • java.lang.OutOfMemoryError: Java heap space

    • java.lang.OutOfMemoryError: GC overhead limit exceeded

    • Cannot allocate memory

    • The job has been killed by "OOM Killer", please check your job's memory usage

  • Solutions:

    • Configure the memory size for each executor.

      • Parameter: spark.executor.memory.

      • Parameter description: This parameter specifies the memory size of each executor. A ratio of 1:4 between spark.executor.cores and spark.executor.memory is recommended. For example, if spark.executor.cores is set to 1, you can set spark.executor.memory to 4 GB. If the error message java.lang.OutOfMemoryError is reported for an executor, you need to increase the parameter value.

    • Configure the off-heap memory for each executor.

      • Parameter: spark.executor.memoryOverhead.

      • Parameter description: This parameter specifies the additional memory size of each executor. The additional memory is mainly used for the overheads of Java virtual machines (JVMs), strings, and NIO buffers. The default memory size is calculated by using the following formula: spark.executor.memory × 0.1. The minimum size is 384 MB. In most cases, you do not need to change the default value. If the error message Cannot allocate memory or OOM Killer is logged for an executor, you need to increase the parameter value.

    • Configure the driver memory.

      • Parameter: spark.driver.memory.

      • Parameter description: This parameter specifies the memory size of the driver. A ratio of 1:4 between spark.driver.cores and spark.driver.memory is recommended. If the driver needs to collect a large amount of data or the error message java.lang.OutOfMemoryError is reported, you need to increase the parameter value.

    • Configure the off-heap memory for the driver.

      • Parameter: spark.driver.memoryOverhead.

      • Parameter description: This parameter specifies the additional memory size of the driver. The default size is calculated by using the following formula: spark.driver.memory × 0.1. The minimum size is 384 MB. If the error message Cannot allocate memory is logged for the driver, you need to increase the parameter value.

What do I do if the disk space is insufficient?

  • Problem description

    The error message No space left on device is reported.

  • Possible causes: The local disk space on a specific executor is insufficient. This results in the exit of the executor.

  • Solutions:

    • Increase the disk size.

      • Parameter: spark.hadoop.odps.cupid.disk.driver.device_size.

      • Default value: 20 GB.

      • Parameter description: By default, a 20-GB local disk is separately provided for the driver and each executor. If the local disk space is insufficient, you can increase the parameter value. Take note that this parameter takes effect only after you add this parameter to the spark-defaults.conf file or DataWorks parameters.

    • Increase the number of executors.

      If this error persists after you resize the local disk to 100 GB, the shuffled data of a single executor exceeds the upper limit. This may be caused by data skew. In this case, try to repartition data. If a huge amount of data is stored on the local disk, reconfigure the spark.executor.instances parameter to increase the number of executors.

How do I reference resources in MaxCompute projects?

Use one of the following methods to access resources in MaxCompute:

  • Method 1: Directly reference MaxCompute resources by configuring a specified parameter.

    • Parameter: spark.hadoop.odps.cupid.resources.

    • Parameter format: <projectname>.<resourcename>[:<newresourcename>].

    • Parameter description: This parameter specifies the MaxCompute resources that are required for running a task. For more information, see Resource operations. The specified resources are downloaded to the current working directories of the driver and executors. The same task can reference multiple resources. Separate resources with commas (,). The default name of a resource is in the format of <projectname>.<resourcename> after it is downloaded to the working directories. When you configure this parameter, rename the resource in the <projectname>.<resourcename>:<newresourcename> format. Take note that this parameter takes effect only after you add this parameter to the spark-defaults.conf file or DataWorks parameters.

    • Example:

      ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
      
      ## Reference multiple resources at the same time: Reference both public.python-python-2.7-ucs4.zip and public.myjar.jar.
      spark.hadoop.odps.cupid.resources=public.python-python-2.7-ucs4.zip,public.myjar.jar
      
      ## Rename a resource: Reference public.myjar.jar and rename public.myjar.jar as myjar.jar.
      spark.hadoop.odps.cupid.resources=public.myjar.jar:myjar.jar
  • Method 2: Reference resources in DataWorks.

    • Add resources in MaxCompute to a business workflow on the DataWorks DataStudio page. For more information, see Manage MaxCompute resources.

    • Select JAR files, files, and archive files in the ODPS Spark nodes on DataWorks as resources.

    Note

    When you use this method, resources are uploaded during the task runtime. For large amounts of resources, we recommend that you reference resources by using Method 1.

How do I use Spark on MaxCompute to access a VPC?

Use one of the following methods to access services in Alibaba Cloud VPCs:

  • Reverse access

    • Limits

      You can use Spark on MaxCompute to access only Alibaba Cloud VPCs in the same region as MaxCompute.

    • Procedure

      1. Add an IP address whitelist to the service that you want to access to allow access from the CIDR block 100.104.0.0/16.

      2. Configure the spark.hadoop.odps.cupid.vpc.domain.list parameter for your job.

        This parameter reflects the network situations of one or more instances that you need to access. The parameter value is in the JSON format. You need to convert the JSON value into one line. In the following example, you need to replace the values of regionId, vpcId, domain, and port with the actual region ID, VPC ID, domain name, and port number.

        ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
        
        spark.hadoop.odps.cupid.vpc.domain.list={"regionId":"cn-beijing","vpcs":[{"vpcId":"vpc-2zeaeq21mb1dmkqh*****","zones":[{"urls":[{"domain":"dds-2ze3230cfea0*****.mongodb.rds.aliyuncs.com","port":3717},{"domain":"dds-2ze3230cfea0*****.mongodb.rds.aliyuncs.com","port":3717}]}]}]}
  • ENI-based access

    • Limits

      You can connect Spark on MaxCompute to a VPC in the same region as MaxCompute by using an elastic network interface (ENI). If your job needs to access multiple VPCs at the same time, you can connect the VPC that is connected by using the ENI to another VPC.

    • Procedure

      1. Create an ENI by following the instructions in Access instances in a VPC from Spark on MaxCompute.

      2. Add a whitelist to the service that you want to access and authorize the MaxCompute security group created in Step i to access a specific port.

        For example, if you need to access an ApsaraDB RDS instance, you need to add a security group rule to the instance to allow access from the security group created in Step i. If you cannot add a security group to the service that you need to access, you need to add the vSwitch CIDR block used in Step i.

      3. Configure the spark.hadoop.odps.cupid.eni.info and spark.hadoop.odps.cupid.eni.enable parameters for your job.

        In the following example, replace regionid with the actual region ID and vpcid with the actual VPC ID.

        ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
        
        spark.hadoop.odps.cupid.eni.enable = true
        spark.hadoop.odps.cupid.eni.info = [regionid]:[vpcid]

How do I use Spark on MaxCompute to access the Internet?

Use one of the following methods to access the Internet:

  • SmartNAT-based access

    In this example, you need to access https://aliyundoc.com:443. Perform the following steps:

    1. Submit a ticket or search for the DingTalk group (ID: 11782920) and join the MaxCompute developer community. Then, ask MaxCompute technical support engineers to add https://aliyundoc.com:443 to odps.security.outbound.internetlist.

    2. Use the following settings to configure a whitelist for access over the Internet and enable SmartNAT for your Spark job.

      ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
      spark.hadoop.odps.cupid.internet.access.list=aliyundoc.com:443
      spark.hadoop.odps.cupid.smartnat.enable=true
  • ENI-based access

    1. Create an ENI by following the instructions in Access instances in a VPC from Spark on MaxCompute.

    2. Confirm that the VPC can access the Internet by using an ENI. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.

    3. Use the following settings to configure a whitelist for access over the Internet and enable the ENI for your Spark job. Replace region with the actual region ID and vpcid with the actual VPC ID.

      ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
      spark.hadoop.odps.cupid.internet.access.list=aliyundoc.com:443
      spark.hadoop.odps.cupid.eni.enable=true
      spark.hadoop.odps.cupid.eni.info=[region]:[vpcid]

How do I use Spark on MaxCompute to access OSS?

Spark on MaxCompute allows you to use Jindo SDK to access Alibaba Cloud OSS. You must configure the following information:

  • Configure the Jindo SDK and an OSS endpoint.

    The following code shows an example.

    ## Reference the Jindo SDK JAR file. Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
    spark.hadoop.odps.cupid.resources=public.jindofs-sdk-3.7.2.jar
    
    ## Specify the OSS implementation classes. 
    spark.hadoop.fs.AbstractFileSystem.oss.impl=com.aliyun.emr.fs.oss.OSS
    spark.hadoop.fs.oss.impl=com.aliyun.emr.fs.oss.JindoOssFileSystem
    
    ## Specify an OSS endpoint.
    spark.hadoop.fs.oss.endpoint=oss-[YourRegionId]-internal.aliyuncs.com
    
    ## In most cases, you do not need to configure an OSS endpoint whitelist. If the network connection is disconnected during job runtime, you can configure a whitelist by using the following parameter. 
    ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
    spark.hadoop.odps.cupid.trusted.services.access.list=[YourBucketName].oss-[YourRegionId]-internal.aliyuncs.com
    Note

    When Spark on MaxCompute runs in cluster mode, only OSS internal endpoints are supported. OSS public endpoints are not supported. For more information about the mappings between OSS regions and endpoints, see Regions and endpoints.

  • Configure OSS authentication information. Jindo SDK supports the following authentication methods:

    • Use AccessKey pairs for authentication. Sample configurations:

      val conf = new SparkConf()
        .setAppName("jindo-sdk-demo")
        # Configure parameters for AccessKey pair-based authentication.
        .set("spark.hadoop.fs.oss.accessKeyId", "<YourAccessKeyId")
        .set("spark.hadoop.fs.oss.accessKeySecret", "<YourAccessKeySecret>")
    • Use security token service (STS) tokens for authentication by performing the following steps:

      1. Go to the Cloud Resource Access Authorization page and click Confirm Authorization Policy. Then, the MaxCompute project can access OSS resources of the current Alibaba Cloud account by using an STS token.

        Note

        You can perform this operation only when the owner of the MaxCompute project is an Alibaba Cloud account that owns the OSS resources to be accessed.

      2. Enable the local HTTP service.

        The following code shows an example.

        ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
        spark.hadoop.odps.cupid.http.server.enable = true
      3. Configure authentication information.

        The following code shows an example.

        val conf = new SparkConf()
          .setAppName("jindo-sdk-demo")
          # Configure a RAM role of the Alibaba Cloud account for authentication.
          # ${aliyun-uid} specifies the unique ID of the Alibaba Cloud account.
          # ${role-name} specifies the role name.
          .set("spark.hadoop.fs.jfs.cache.oss.credentials.provider", "com.aliyun.emr.fs.auth.CustomCredentialsProvider")
          .set("spark.hadoop.aliyun.oss.provider.url", "http://localhost:10011/sts-token-info?user_id=${aliyun-uid}&role=${role-name}")

How do I reference a third-party Python library?

  • Problem description: When a PySpark job is running, the error message No module named 'xxx' is reported.

  • Possible causes: PySpark jobs depend on third-party Python libraries. However, the third-party Python libraries are not installed in the default Python environment of the current MaxCompute platform.

  • Solutions: Use one of the following solutions to add third-party library dependencies.

    • Directly use the MaxCompute Python public environment.

      You need to only add the following configurations to the DataWorks parameters or spark-defaults.conf file. The following code shows the configurations of different Python versions.

      • Python 2

        ## Configuration of Python 2.7.13
        ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
        spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz
        spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python
        
        ## List of third-party libraries
        https://odps-repo.oss-cn-hangzhou.aliyuncs.com/pyspark/py27/py27-default_req.txt.txt
      • Python 3

        ## Configuration of Python 3.7.9
        ## Add the following configurations to the DataWorks parameters or the spark-defaults.conf file.
        spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz
        spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3
        
        ## List of third-party libraries
        https://odps-repo.oss-cn-hangzhou.aliyuncs.com/pyspark/py37/py37-default_req.txt
    • Upload a single wheel package.

      This solution is suitable for scenarios where a small number of third-party Python library dependencies are required and the dependencies are relatively simple. The following code shows an example.

      ## Rename the wheel package as a ZIP file. For example, rename the pymysql wheel package as pymysql.zip.
      ## Upload the pymysql.zip file as a resource of the archive type.
      ## Reference the archive file on the DataWorks Spark node.
      ## Add the following configurations to the spark-defaults.conf file or DataWorks parameters and perform the import operation.
      ## Add the configurations.
      spark.executorEnv.PYTHONPATH=pymysql
      spark.yarn.appMasterEnv.PYTHONPATH=pymysql
      
      ## Upload code.
      import pymysql
    • Upload a complete custom Python environment.

      This solution is suitable for scenarios where dependencies are complex or a custom Python version is required. You need to use a Docker container to package and upload the complete Python environment. For more information, see the "Upload required packages" section in Develop a Spark on MaxCompute application by using PySpark.

How do I resolve JAR dependency conflicts?

  • Problem description: The error message NoClassDefFoundError or NoSuchMethodError is reported during runtime.

  • Possible causes: The versions of third-party dependencies in JAR files conflict with the versions of Spark dependencies. You need to check the main JAR file that you upload and third-party dependency libraries and identify the dependency that causes the version conflict.

  • Solutions:

    • Check the POM file.

      • Confirm that the scope of the Apache Spark dependency is set to provided.

      • Confirm that the scope of the Apache Hadoop dependency is set to provided.

      • Confirm that the scope of the Odps or Cupid dependency is set to provided.

    • Exclude the dependency that causes the version conflict.

    • Use the relocation feature provided by the Apache Maven Shade plug-in to resolve the issue.

How do I debug Spark on MaxCompute in local mode?

  • Spark 2.3.0

    1. Add the following configurations to the spark-defaults.conf file.

      spark.hadoop.odps.project.name =<Yourprojectname>
      spark.hadoop.odps.access.id =<YourAccessKeyID>
      spark.hadoop.odps.access.key =<YourAccessKeySecret>
      spark.hadoop.odps.end.point =<endpoint>
    2. Run your job in local mode.

      ./bin/spark-submit --master local spark_sql.py
  • Spark 2.4.5/Spark 3.1.1

    1. Create a file named odps.conf and add the following configurations to the file.

      odps.access.id=<YourAccessKeyID>
      odps.access.key=<YourAccessKeySecret>
      odps.end.point=<endpoint>
      odps.project.name=<Yourprojectname>
    2. Add an environment variable to point to the path of the odps.conf file.

      export ODPS_CONF_FILE=/path/to/odps.conf
    3. Run your job in local mode.

      ./bin/spark-submit --master local spark_sql.py
  • Common errors

    • Error 1:

      • Error messages:

        • Incomplete config, no accessId or accessKey.

        • Incomplete config, no odps.service.endpoint.

      • Possible causes: Event logging is enabled in local mode.

      • Solutions: Delete the spark.eventLog.enabled=true configuration from the spark-defaults.conf file.

    • Error 2:

      • Error message: Cannot create CupidSession with empty CupidConf.

      • Possible causes: Spark 2.4.5 or Spark 3.1.1 cannot read information such as odps.access.id.

      • Solutions: Create the odps.conf file, add the environment variable to the file, and then run your job.

    • Error 3:

      • Error message: java.util.NoSuchElementException: odps.access.id.

      • Possible causes: Spark 2.3.0 cannot read information such as odps.access.id.

      • Solutions: Add configuration information such as spark.hadoop.odps.access.id to the spark-defaults.conf file.

What do I do if the error message "User signature does not match" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    Stack:
    com.aliyun.odps.OdpsException: ODPS-0410042:
    Invalid signature value - User signature does not match
  • Possible causes

    The identity authentication failed. The AccessKey ID or AccessKey secret is invalid.

  • Solutions

    Check whether the AccessKey ID and AccessKey secret in the spark-defaults.conf file are the same as the AccessKey ID and AccessKey secret in User Management in the Alibaba Cloud console. If they are not the same, modify the AccessKey ID and AccessKey secret in the file.

What do I do if the error message "You have NO privilege" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    Stack:
    com.aliyun.odps.OdpsException: ODPS-0420095: 
    Access Denied - Authorization Failed [4019], You have NO privilege 'odps:CreateResource' on {acs:odps:*:projects/*}
  • Possible causes

    You do not have the required permissions.

  • Solutions

    Ask the project owner to grant the Read and Create permissions on the resource to your account. For more information about authorization, see MaxCompute permissions.

What do I do if the error message "Access Denied" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    Exception in thread "main" org.apache.hadoop.yarn.exceptions.YarnException: com.aliyun.odps.OdpsException: ODPS-0420095: Access Denied - The task is not in release range: CUPID
  • Possible causes

    • Cause 1: The AccessKey ID and AccessKey secret configured in the spark-defaults.conf file are invalid.

    • Cause 2: Spark on MaxCompute is not available in the region where the project resides.

  • Solutions

    • Solution to Cause 1: Check the configuration information in the spark-defaults.conf file. If the AccessKey ID and AccessKey secret in the file are invalid, modify them. For more information, see Set up a Linux development environment.

    • Solution to Cause 2: Check whether Spark on MaxCompute is available in the region where the project resides or join the DingTalk group (ID: 21969532) for technical support.

What do I do if the error message "No space left on device" is reported when I run a Spark job?

Spark on MaxCompute uses disks for local storage. Both the shuffled data and the data that overflows from the BlockManager are all stored on disks. You can specify the disk size by using the spark.hadoop.odps.cupid.disk.driver.device_size parameter. The default value is 20 GB, and the maximum value is 100 GB. If the issue persists after you increase the disk space to 100 GB, you need to further analyze the issue. The most common cause is data skew. Data is centrally distributed in specific blocks during the shuffling and caching processes. In this case, change the value of the spark.executor.cores parameter to decrease the number of CPU cores on a single executor and change the value of the spark.executor.instances parameter to increase the number of executors.

What do I do if the error message "Table or view not found" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    Table or view not found:xxx
  • Possible causes

    • Cause 1: The table or view does not exist.

    • Cause 2: The catalog configuration of Hive is enabled.

  • Solutions

    • Solution to Cause 1: Create a table.

    • Solution to Cause 2: Remove the catalog configuration. For example, in the following configuration, remove enableHiveSupport().

      spark = SparkSession.builder.appName(app_name).enableHiveSupport().getOrCreate()

What do I do if the error message "Shutdown hook called before final status was reported" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    App Status: SUCCEEDED, diagnostics: Shutdown hook called before final status was reported.
  • Possible causes

    The main method that is executed for the cluster does not request cluster resources by using ApplicationMaster. For example, the user does not create a SparkContext or the user sets spark.master to local in code.

What do I do if a JAR package version conflict occurs when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    User class threw exception: java.lang.NoSuchMethodError
  • Possible causes

    A version conflict or class error occurs on the JAR package.

  • Solutions

    1. Find the JAR package that contains the abnormal class in the $SPARK_HOME/jars path.

    2. Run the following command to obtain the directory and version of the third-party library:

      grep <Abnormal class name> $SPARK_HOME/jars/*.jar
    3. Run the following command to view all dependencies of the project in the root directory of the Spark job:

      mvn dependency:tree
    4. Find the dependency that causes a version conflict and run the following command to exclude the package:

      maven dependency exclusions
    5. Recompile and commit the code.

What do I do if the error message "ClassNotFound" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    java.lang.ClassNotFoundException: xxxx.xxx.xxxxx
  • Possible causes

    The class does not exist or the dependency is incorrectly configured.

  • Solutions

    1. Run the following command to check whether the JAR file that you submit contains the class definition:

      jar -tf <Job JAR file> | grep <Class name>
    2. Check whether the dependencies in the pom.xml file are correctly configured.

    3. Use the Apache Maven Shade plug-in to submit a JAR file.

What do I do if the error message "The task is not in release range" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    The task is not in release range: CUPID
  • Possible causes

    The Spark on MaxCompute service is not activated in the region where the project resides.

  • Solutions

    Select a region where the Spark on MaxCompute service is activated.

What do I do if the error message "java.io.UTFDataFormatException" is reported when I run a Spark job?

  • Problem description

    The following error message is reported when a Spark job is running:

    java.io.UTFDataFormatException: encoded string too long: 2818545 bytes 
  • Solutions

    Change the value of the spark.hadoop.odps.cupid.disk.driver.device_size parameter in the spark-defaults.conf file. The default value is 20 GB, and the maximum value is 100 GB.

What do I do if garbled Chinese characters are printed when I run a Spark job?

Add the following configurations:

"--conf" "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"
"--conf" "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"

What do I do if an error message is reported when Spark on MaxCompute calls a third-party task over the Internet?

Spark on MaxCompute cannot call third-party tasks over the Internet because the network connection is disconnected.

To resolve the issue, build an NGINX reverse proxy in a VPC and access the Internet by using the proxy. Spark on MaxCompute supports direct access to a VPC. For more information, see Access instances in a VPC from Spark on MaxCompute.

  • On this page (1, T)
  • How do I perform self-check on my project?
  • How do I run an ODPS Spark node in DataWorks?
  • How do I debug Spark on MaxCompute in an on-premises environment?
  • How do I use Spark on MaxCompute to access services in a VPC?
  • How do I reference a JAR file as a resource?
  • How do I pass parameters by using Spark on MaxCompute?
  • How do I write the DataHub data that is read by Spark in streaming mode to MaxCompute?
  • How do I migrate open source Spark code to Spark on MaxCompute?
  • How do I use Spark on MaxCompute to process data in a MaxCompute table?
  • How do I configure the resource parallelism for Spark on MaxCompute?
  • How do I resolve OOM issues?
  • What do I do if the disk space is insufficient?
  • How do I reference resources in MaxCompute projects?
  • How do I use Spark on MaxCompute to access a VPC?
  • How do I use Spark on MaxCompute to access the Internet?
  • How do I use Spark on MaxCompute to access OSS?
  • How do I reference a third-party Python library?
  • How do I resolve JAR dependency conflicts?
  • How do I debug Spark on MaxCompute in local mode?
  • What do I do if the error message "User signature does not match" is reported when I run a Spark job?
  • What do I do if the error message "You have NO privilege" is reported when I run a Spark job?
  • What do I do if the error message "Access Denied" is reported when I run a Spark job?
  • What do I do if the error message "No space left on device" is reported when I run a Spark job?
  • What do I do if the error message "Table or view not found" is reported when I run a Spark job?
  • What do I do if the error message "Shutdown hook called before final status was reported" is reported when I run a Spark job?
  • What do I do if a JAR package version conflict occurs when I run a Spark job?
  • What do I do if the error message "ClassNotFound" is reported when I run a Spark job?
  • What do I do if the error message "The task is not in release range" is reported when I run a Spark job?
  • What do I do if the error message "java.io.UTFDataFormatException" is reported when I run a Spark job?
  • What do I do if garbled Chinese characters are printed when I run a Spark job?
  • What do I do if an error message is reported when Spark on MaxCompute calls a third-party task over the Internet?
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare