This topic provides answers to some frequently asked questions about the serverless Spark engine of Data Lake Analytics (DLA). 12121
Common issues
When a Spark job is running, the error message "No space left on device" appears. What do I do?
What do I do if the serverless Spark engine cannot access the network of data sources?
What do I do if the error message "ClassNotFound" appears in Spark job logs?
What do I do if the error message "NoSuchMethod" appears in Spark job logs?
Why does the error message "oss object 403" appear in Spark job logs?
Common issues
When a Spark job is running, the following error message appears: "The VirtualCluster's name is invalid or the VirtualCluster's is not in running state." What do I do?
Cause: The VcName parameter is not configured because the virtual cluster does not exist or is not in the Running state.
Solution: Enter the correct virtual cluster name.
If the error message appears on the Parameter Configuration page of the DLA console, create a virtual cluster and submit a job. For more information about how to create a virtual cluster, see Create a virtual cluster.
If you use API, spark-submit script, or JupyterLab to create a virtual cluster, check whether the virtual cluster name that is specified by the VcName parameter is correctly entered. To obtain a correct virtual cluster name, perform the following steps:
Log on to the DLA console.
In the left-side navigation pane, click Virtual Cluster management.
In the cluster list, select a cluster that is in the Running state and obtain its name.
When a Spark job is running, the following error message appears: "User %s do not have right permission [ *** ] to resource [ *** ]." What do I do?
Cause: The Resource Access Management (RAM) user does not have permissions to call the API operation.
Solutions:
Grant the related permissions to the RAM user. For more information, see Grant permissions to a RAM user.
Log on to the RAM console and grant the permissions on the resource to the RAM user.
When a Spark job is running, the error message "No space left on device" appears. What do I do?
Check whether this issue is caused by insufficient storage capacity of the local disk for an executor. You can query the standard error log of the executor on the Spark web UI to check whether this issue is caused by insufficient storage capacity of the local disk for the executor. If this issue is caused by insufficient storage capacity of the local disk, you can add the following parameter settings to increase the storage capacity of the local disk for the executor.
spark.k8s.shuffleVolume.enable: true
spark.dla.local.disk.size: 50Gi
The local disk is an enhanced SSD (ESSD), which is free of charge. You will be charged in the future.
The maximum storage capacity of a local disk is 100 GiB.
If this issue persists after you configure a local disk with a storage capacity of 100 GiB, you can add executors.
What do I do if a Spark executor becomes dead?
Access the Spark web UI. If Failed to connect to /xx.xx.xx.xx:xxxx is included in Failure reason of Failed Stages or a Spark executor that is in the Dead state is displayed on the Executors page, the executor abnormally exits.
For more information about how to access the Spark web UI, see Apache Spark web UI.
Solutions to different symptoms:
The following error message appears in the driver log: "ERROR TaskSchedulerImpl: Lost executor xx on xx.xx.xx.xx:The executor with id xx exited with exit code 137."
Cause: The overall memory usage of the Spark executor exceeds the specified threshold. The overall memory usage of the container exceeds its memory threshold. As a result, the system automatically invokes the kill -9 command to forcibly kill the Spark executor. The memory resources occupied by the Spark executor include the memory resources used by the Java virtual machine (JVM) and the memory resources used by the shuffle, cache, and Python user-defined functions (UDFs).
Solution: Increase the value of the spark.executor.memoryOverhead parameter. The unit of this parameter is megabytes. This parameter specifies the memory that can be used by processes except for executors in a container. By default, the processes except for executors can consume up to 30% of the total memory of the container. For example, if you select the medium specifications (2 CPU cores and 8 GB of memory) for the executor, the default value of the spark.executor.memoryOverhead parameter is 2457. In this case, you can set the spark.executor.memoryOverhead parameter to 4000.
spark.executor.memoryOverhead = 4000
2.The error message "java.lang.OutOfMemoryError" appears in the log.
Cause: On the Executors page of the Spark web UI, find the executor that is in the Dead state and click the URL of the standard error log or standard output log of the executor to identify the cause.
Solutions:
Reduce the memory usage of the Spark executor.
Select higher resource specifications for the executor. For example, change the resource specifications from small to medium.
3.Other error messages appear.
Check the logs of the executor that is in the Dead state. If the error message shows that the issue is caused by business code errors, contact business development personnel to resolve this issue. For other causes, use the search engine to search for their solutions.
What do I do if the serverless Spark engine cannot access the network of data sources?
By default, the serverless Spark engine of DLA cannot directly access the network of data sources, such as a virtual private cloud (VPC) or the Internet. It can access only the data sources provided by Alibaba Cloud, such as Object Storage Service (OSS), Tablestore, and MaxCompute.
You can attach an elastic network interface (ENI) of your VPC to the serverless Spark engine of DLA. This way, the serverless Spark engine can access the data sources in your VPC and access Internet services. After you attach the ENI to the serverless Spark engine, the network model of the serverless Spark engine is the same as that of an Elastic Compute Service (ECS) instance in the VPC. The serverless Spark engine supports all features of Alibaba Cloud Virtual Private Cloud, including internal network access, Internet access, and network access by using leased lines. For more information about how to attach an ENI to the serverless Spark engine, see Configure the network of data sources.
What do I do if the error message "ClassNotFound" appears in Spark job logs?
Cause: The required class is not included in the JAR package that is used to configure a Spark job. Run the jar tvf xxx.jar | grep xxxx command to check whether the class is included in the package.
Solutions:
If a business-related class is not included in the package, generate the JAR package again and make sure that the class is included in the package.
If a third-party package is used, you can use one of the following methods to resolve this issue:
Use the Shade or Assembly plug-in of Maven to package the class and dependencies into a file. For more information about the Shade and Assembly plug-ins, see the related Maven documentation.
Use the jars parameter to specify a separate directory to which the third-party package is uploaded. For more information about the jars parameter, see Configure a Spark job.
What do I do if the error message "NoSuchMethod" appears in Spark job logs?
Cause: A conflict occurs between JAR packages. This issue may occur if a third-party package that conflicts with the Spark package is uploaded to configure the Spark job.
Solution: To resolve this issue, you can set the Maven scope to provided or use the relocation feature. You can search for these common methods to resolve this issue. You can also configure the POM file in the Maven repository by referring to the POM file of alibabacloud-dla-demo.
After I execute the SHOW TABLES or SHOW DATABASE statement for a Spark SQL job, only some tables or databases are listed. Why?
Check whether the metadata service of DLA is used.
If the metadata service of DLA is used, check whether you are authorized to query all the tables or databases. If you are not authorized to query all the tables or databases, you can query only the databases and tables on which you have permissions. Other databases or tables cannot be queried.
If a self-managed metadata service is used, check whether you are granted the permissions on the self-managed metadata service and check whether the serverless Spark engine can access the self-managed metadata service.
The select * from db1.table1 statement is successfully executed in DLA SQL but fails to be executed in Spark SQL. Why?
Detailed error message:
java.lang.NullPointerException
at java.net.URI$Parser.parse(URI.java:3042)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.sql.hive.client.JianghuClientImpl$$anonfun$getTableOption
Causes:
The table can be stored only in an OSS directory or HDFS directory.
A hyphen (
-
) is included in the table name. The hyphen is a reserved word for Spark SQL and cannot be included in the table name.
The query result is returned after the select * from db1.table1 statement is executed in DLA SQL, but the query result is not returned after the statement is executed in Spark SQL. Why?
Causes:
The OSS directories or HDFS directories in the table have nested relationships. For example, if you set LOCATION to
oss://db/table/
and specify two partition key columns partion1 and parition2, your data file data.csv may be stored in the following or similar directory:oss//db/table/partition1=a/partiion2=b/
. DLA SQL supports multiple-layer nested directories, such asoss//db/table/partition1=a/partition2=b/extral_folder/data.csv
. However, Spark SQL does not support such directories.The syntax of DLA SQL is slightly different from that of Spark SQL. For more information about how to debug SQL, see Spark SQL.
Why does the error message "oss object 403" appear in Spark job logs?
Causes:
The serverless Spark engine of DLA cannot read data from JAR packages or files in a directory of OSS that is not deployed in the same region as DLA.
The role specified by the spark.dla.rolearn parameter does not have permissions to read data from this OSS directory.
The OSS directory for saving the files is incorrectly entered.
Files are not separated by commas (,) or are not listed as JSON expressions.
When Spark SQL reads data from an external table in the JSON format(including the table that is automatically created for log shipping), the following error message appears: "ClassNotFoundException: org.apache.hadoop.hive.serde2.JsonSerDe." What do I do?
Solution:
Enter the URL https://repo1.maven.org/maven2/org/apache/hive/hive-serde/3.1.2/hive-serde-3.1.2.jar in the address bar of a browser and download the hive-serde-3.1.2.jar file. Then, upload the JAR file to OSS.
Place the statement
add jar oss://path/to/hive-serde-3.1.2.jar;
before the Spark SQL statement.
When a Spark SQL statement is executed, the following error message appears: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: oss." What do I do?
Solution: Place the following command before the statement:
set spark.dla.connectors=oss;
What do I do if a job runs slowly?
Symptoms:
The job abnormally runs. You can use Method 1 or 2 to resolve this issue.
The job is normal but runs slowly. You can use Method 3, 4, or 5 to resolve this issue.
1. Check whether the executor is in the Dead state.
To perform this operation, find your job and click SparkUI in the Actions column to access the Spark web UI. On the Executors page, query the Status field to check whether the executor is in the Dead state.
Solution: For more information, see What do I do if a Spark executor becomes dead?
2. Query the logs of the driver to check whether an error message that indicates that the job is terminated and retried is included in the logs.
To perform this operation, find your job and click SparkUI in the Actions column to access the Spark web UI. On the Executors page, find the executor whose Executor ID is the ID of the driver and query its standard error log.
Solution: Check the specific cause of this issue based on the error message. In most cases, this issue is caused by business logic errors. You can check whether the business logic errors occur. If these errors occur, use the search engine to search for the solution.
If this issue is caused by out of memory (OOM), check whether the business logic occupies a large number of memory resources, especially when the size of a field is excessively large. If you require more memory resources, you can select higher specifications for the driver or executors.
3. Check whether this issue is caused by insufficient resources.
To perform this operation, find your job and click SparkUI in the Actions column to access the Spark web UI. On the Stages page, find the stage in which the job runs slowly and query the job parallelism in this stage based on Tasks: Succeeded/Total.
If the value of Total in the stage is greater than N, memory resources are insufficient. N is calculated by using the following formula: Number of executors × Number of CPU cores on each executor.
For example, if the value of Total is 100, the spark.executor.instances parameter is set to 5, and the spark.executor.resourceSpec parameter is set to medium (2 CPU cores and 8 GB of memory), a maximum of 10 jobs can run in a stage at the same time and each job must run for 10 stages.
In this case, increase the value of the spark.executor.instances or spark.executor.resourceSpec parameter. We recommend that the total number of jobs that can run in a stage at the same time not be exceeded. Otherwise, resources may be wasted.
4. Check whether this issue is caused by a garbage collection (GC).
To perform this operation, find your job and click SparkUI in the Actions column to access the Spark web UI. On the Executors page, query the value of the Task Time (GC Time) field.
If the field value is large for some executors, you can use one of the following methods to decrease the field value:
Optimize business logic.
Select higher specifications for executors or the driver.
5. Query the stack information of executors or the driver to identify the cause of this issue.
To perform this operation, find your job and click SparkUI in the Actions column to access the Spark web UI. On the Executors page, query the value of the Thread Dump field.
You can view the value of the Thread Dump field only if a job is in the Running state.
Solution: Refresh the field value multiple times to identify the cause of this issue.
If this issue is caused by invocation failures of one or more functions, errors occur on the logic that corresponds to the function invocation or the logic is inefficient. You can optimize the logic.
If this issue is caused by Spark code logic errors, you can use the search engine to search for the solution.