You can run Spark on MaxCompute tasks in local or cluster mode. You can also run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling. This topic describes how to configure and schedule a Spark on MaxCompute task in DataWorks.
Prerequisites
An ODPS Spark node is created. For more information, see Create and manage ODPS nodes.
Limits
If an error is reported when you commit an ODPS Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.
Background information
Spark on MaxCompute is a computing service that is provided by MaxCompute and is compatible with open source Spark. Spark on MaxCompute provides a Spark computing framework based on unified computing resource and dataset permission systems. Spark on MaxCompute allows you to use your preferred development method to submit and run Spark tasks. Spark on MaxCompute can meet diverse data processing and analytics requirements. In DataWorks, you can use ODPS Spark nodes to schedule and run Spark on MaxCompute tasks and integrate Spark on MaxCompute tasks with other types of tasks.
Spark on MaxCompute allows you to use Java, Scala, or Python to develop tasks and run the tasks in local or cluster mode. Spark on MaxCompute also allows you to run offline Spark on MaxCompute tasks in cluster mode in DataWorks. For more information about the running modes of Spark on MaxCompute tasks, see Running modes.
Preparations
ODPS Spark nodes allow you to use Java, Scala
, or Python
to develop and run offline Spark on MaxCompute tasks. The operations and parameters that are required for developing the offline Spark on MaxCompute tasks vary based on the programming language that you use. You can select a programming language based on your business requirements.
Java/Scala
Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine and upload the code to DataWorks as a MaxCompute resource. You must perform the following steps:
Prepare a development environment.
You must prepare the development environment in which you want to run a Spark on MaxCompute task based on the operating system that you use. For more information, see Set up a Linux development environment or Set up a Windows development environment.
Develop Java or Scala code.
Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine or in the prepared development environment. We recommend that you use the sample project template provided by Spark on MaxCompute.
Package the developed code and upload the code to DataWorks.
After the code is developed, you must package the code and upload the package to DataWorks as a MaxCompute resource. For more information, see Create and use MaxCompute resources.
Programming language: Python (Use the default Python environment)
DataWorks allows you to develop a PySpark task by writing code to a Python resource online in DataWorks and commit and run the code by using an ODPS Spark node. For information about how to create a Python resource in DataWorks and view examples for developing Spark on MaxCompute applications by using PySpark, see Create and use MaxCompute resources and Develop a Spark on MaxCompute application by using PySpark.
You can use the default Python environment provided by DataWorks to develop code. If third-party packages that are supported by the default Python environment cannot meet the requirements of the PySpark task, you can refer to Programming language: Python (Use a custom Python environment) to prepare a custom Python environment. You can also use PyODPS 2 nodes or PyODPS 3 nodes, which support more Python resources for the development.
Programming language: Python (Use a custom Python environment)
If the default Python environment cannot meet your business requirements, you can perform the following steps to prepare a custom Python environment to run your Spark on MaxCompute task.
Prepare a Python environment on your on-premises machine.
You can refer to PySpark Python versions and supported dependencies to configure a Python environment based on your business requirements.
Package the code for the Python environment and upload the package to DataWorks.
You must package the code for the Python environment in the ZIP format and upload the package to DataWorks as a MaxCompute resource. This way, you can run the Spark on MaxCompute task in the environment. For more information, see Create and use MaxCompute resources.
Descriptions of parameters
You can run offline Spark on MaxCompute tasks in cluster mode in DataWorks. In this mode, you must specify the Main
method as the entry point of a custom application. A Spark task ends when Main
enters the Success
or Fail
state. You must add the configuration items in the spark-defaults.conf
file to the configurations of the ODPS Spark node. For example, you must add the configuration items such as the number of executors
, the memory size, and spark.hadoop.odps.runtime.end.point
.
You do not need to upload the spark-defaults.conf
file. Instead, you must add the configuration items in the spark-defaults.conf
file to the configurations of the ODPS Spark node one by one.
Parameter | Description | spark-submit command |
Spark Version | The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x. Note If an error is reported when you commit an ODPS Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group. | None |
Language | The programming language. Valid values: Java/Scala and Python. You can select a programming language based on your business requirements. | None |
Main JAR Resource | The main JAR or Python resource file. You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources. |
|
Configuration Items | The configuration items that are required to submit the Spark on MaxCompute task.
|
|
Main Class | The name of the main class. This parameter is required only if you set the Language parameter to |
|
Parameters | You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the ${Variable name} format. After the parameters are added, you must click the Properties tab in the right-side navigation pane and assign values to the related variables in the Scheduling Parameter section. Note For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters. |
|
Other resources | The following types of resources are also supported. You can select the following types of resources based on your business requirements.
You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources. | Commands for different types of resources:
|
Simple code editing example
This section provides a simple example to show how to use an ODPS Spark node to develop a Spark on MaxCompute task. In this example, a Spark on MaxCompute task is developed to determine whether a string can be converted into digits.
Create a resource.
On the DataStudio page of the DataWorks console, create a Python resource named spark_is_number.py. For more information, see Create and use MaxCompute resources. Sample code:
# -*- coding: utf-8 -*- import sys from pyspark.sql import SparkSession try: # for python 2 reload(sys) sys.setdefaultencoding('utf8') except: # python 3 not needed pass if __name__ == '__main__': spark = SparkSession.builder\ .appName("spark sql")\ .config("spark.sql.broadcastTimeout", 20 * 60)\ .config("spark.sql.crossJoin.enabled", True)\ .config("odps.exec.dynamic.partition.mode", "nonstrict")\ .config("spark.sql.catalogImplementation", "odps")\ .getOrCreate() def is_number(s): try: float(s) return True except ValueError: pass try: import unicodedata unicodedata.numeric(s) return True except (TypeError, ValueError): pass return False print(is_number('foo')) print(is_number('1')) print(is_number('1.3')) print(is_number('-1.37')) print(is_number('1e3'))
Save and commit the resource.
In the created ODPS Spark node, configure parameters and scheduling properties for the MaxCompute Spark task by referring to the Descriptions of parameters section in this topic, and save and commit the node.
Parameter
Description
Spark Version
Select Spark2.x.
Language
Select Python.
Main Python Resource
The Python resource spark_is_number.py that you created.
Go to Operation Center in the development environment to backfill data for the ODPS Spark node. For more information, see Backfill data and view data backfill instances (new version).
NoteDataWorks does not provide entry points for you to run ODPS Spark nodes in DataStudio. You must run ODPS Spark nodes in Operation Center in the development environment.
View the result.
After the data backfill instance is successfully run, click tracking URL in the run logs that are generated to view the result. The following information is returned:
False True True True True
Advanced code editing examples
For more information about the development of Spark on MaxCompute tasks in other scenarios, see the following topics:
What to do next
After you complete the development of the Spark on MaxCompute task, you can perform the following operations:
Configure scheduling properties: You can configure properties for periodic scheduling of the node. If you want the system to periodically schedule and run the task, you must configure items for the node, such as rerun settings and scheduling dependencies. For more information, see Overview.
Debug the node: You can debug and test the code of the node to check whether the code logic meets your expectations. For more information, see Debugging procedure.
Deploy the node: After you complete all development operations, you can deploy the node. After the node is deployed, the system periodically schedules the node based on the scheduling properties of the node. For more information, see Deploy nodes.
Enable the system to diagnose Spark tasks: MaxCompute provides the Logview tool and Spark Web UI. You can view the logs of Spark tasks to check whether the tasks are submitted and run as expected.