Develop and schedule a MaxCompute Spark task - DataWorks

You can run Spark on MaxCompute tasks in local or cluster mode. You can also run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling. This topic describes how to configure and schedule a Spark on MaxCompute task in DataWorks.

Prerequisites

An ODPS Spark node is created. For more information, see Create and manage ODPS nodes.

Limits

If an error is reported when you commit an ODPS Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.

Background information

Spark on MaxCompute is a computing service that is provided by MaxCompute and is compatible with open source Spark. Spark on MaxCompute provides a Spark computing framework based on unified computing resource and dataset permission systems. Spark on MaxCompute allows you to use your preferred development method to submit and run Spark tasks. Spark on MaxCompute can meet diverse data processing and analytics requirements. In DataWorks, you can use ODPS Spark nodes to schedule and run Spark on MaxCompute tasks and integrate Spark on MaxCompute tasks with other types of tasks.

Spark on MaxCompute allows you to use Java, Scala, or Python to develop tasks and run the tasks in local or cluster mode. Spark on MaxCompute also allows you to run offline Spark on MaxCompute tasks in cluster mode in DataWorks. For more information about the running modes of Spark on MaxCompute tasks, see Running modes.

Preparations

ODPS Spark nodes allow you to use Java, Scala, or Python to develop and run offline Spark on MaxCompute tasks. The operations and parameters that are required for developing the offline Spark on MaxCompute tasks vary based on the programming language that you use. You can select a programming language based on your business requirements.

Java/Scala

Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine and upload the code to DataWorks as a MaxCompute resource. You must perform the following steps:

Prepare a development environment.
You must prepare the development environment in which you want to run a Spark on MaxCompute task based on the operating system that you use. For more information, see Set up a Linux development environment or Set up a Windows development environment.
Develop Java or Scala code.
Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine or in the prepared development environment. We recommend that you use the sample project template provided by Spark on MaxCompute.
Package the developed code and upload the code to DataWorks.
After the code is developed, you must package the code and upload the package to DataWorks as a MaxCompute resource. For more information, see Create and use MaxCompute resources.

Programming language: Python (Use the default Python environment)

DataWorks allows you to develop a PySpark task by writing code to a Python resource online in DataWorks and commit and run the code by using an ODPS Spark node. For information about how to create a Python resource in DataWorks and view examples for developing Spark on MaxCompute applications by using PySpark, see Create and use MaxCompute resources and Develop a Spark on MaxCompute application by using PySpark.

Note

You can use the default Python environment provided by DataWorks to develop code. If third-party packages that are supported by the default Python environment cannot meet the requirements of the PySpark task, you can refer to Programming language: Python (Use a custom Python environment) to prepare a custom Python environment. You can also use PyODPS 2 nodes or PyODPS 3 nodes, which support more Python resources for the development.

Programming language: Python (Use a custom Python environment)

If the default Python environment cannot meet your business requirements, you can perform the following steps to prepare a custom Python environment to run your Spark on MaxCompute task.

Prepare a Python environment on your on-premises machine.
You can refer to PySpark Python versions and supported dependencies to configure a Python environment based on your business requirements.
Package the code for the Python environment and upload the package to DataWorks.
You must package the code for the Python environment in the ZIP format and upload the package to DataWorks as a MaxCompute resource. This way, you can run the Spark on MaxCompute task in the environment. For more information, see Create and use MaxCompute resources.

Descriptions of parameters

You can run offline Spark on MaxCompute tasks in cluster mode in DataWorks. In this mode, you must specify the Main method as the entry point of a custom application. A Spark task ends when Main enters the Success or Fail state. You must add the configuration items in the spark-defaults.conf file to the configurations of the ODPS Spark node. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.

Note

You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the ODPS Spark node one by one.

Spark任务配置

Parameter	Description	spark-submit command
Spark Version	The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x. Note If an error is reported when you commit an ODPS Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.	None
Language	The programming language. Valid values: Java/Scala and Python. You can select a programming language based on your business requirements.	None
Main JAR Resource	The main JAR or Python resource file. You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources.	`app jar or Python file`
Configuration Items	The configuration items that are required to submit the Spark on MaxCompute task. You do not need to configure `spark.hadoop.odps.access.id`, `spark.hadoop.odps.access.key`, or `spark.hadoop.odps.end.point`. By default, the values of these configuration items are the same as those of the MaxCompute project. You can also explicitly configure these items to overwrite their default values if necessary. You do not need to upload the `spark-defaults.conf` file. Instead, you must add the configuration items in the `spark-defaults.conf` file to the configurations of the ODPS Spark node one by one. For example, you must add the configuration items such as the number of executors, the memory size, and `spark.hadoop.odps.runtime.end.point`.	`--conf PROP=VALUE`
Main Class	The name of the main class. This parameter is required only if you set the Language parameter to `Java/Scala`.	`--class CLASS_NAME`
Parameters	You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the ${Variable name} format. After the parameters are added, you must click the Properties tab in the right-side navigation pane and assign values to the related variables in the Scheduling Parameter section. Note For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.	`[app arguments]`
Other resources	The following types of resources are also supported. You can select the following types of resources based on your business requirements. Jar resource: You can select this type of resource only if you set the Language parameter to `Java/Scala`. Python resource: You can select this type of resource only if you set the Language parameter to `Python`. File resource Archive resource: Only compressed resources are displayed. You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources.	Commands for different types of resources: `--jars JARS` `--py-files PY_FILES` `--files FILES` `--archives ARCHIVES`

Simple code editing example

This section provides a simple example to show how to use an ODPS Spark node to develop a Spark on MaxCompute task. In this example, a Spark on MaxCompute task is developed to determine whether a string can be converted into digits.

Create a resource.

On the DataStudio page of the DataWorks console, create a Python resource named spark_is_number.py. For more information, see Create and use MaxCompute resources. Sample code:

# -*- coding: utf-8 -*-
import sys
from pyspark.sql import SparkSession

try:
    # for python 2
    reload(sys)
    sys.setdefaultencoding('utf8')
except:
    # python 3 not needed
    pass

if __name__ == '__main__':
    spark = SparkSession.builder\
        .appName("spark sql")\
        .config("spark.sql.broadcastTimeout", 20 * 60)\
        .config("spark.sql.crossJoin.enabled", True)\
        .config("odps.exec.dynamic.partition.mode", "nonstrict")\
        .config("spark.sql.catalogImplementation", "odps")\
        .getOrCreate()

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        pass

    try:
        import unicodedata
        unicodedata.numeric(s)
        return True
    except (TypeError, ValueError):
        pass

    return False

print(is_number('foo'))
print(is_number('1'))
print(is_number('1.3'))
print(is_number('-1.37'))
print(is_number('1e3'))

Save and commit the resource.

In the created ODPS Spark node, configure parameters and scheduling properties for the MaxCompute Spark task by referring to the Descriptions of parameters section in this topic, and save and commit the node.
Parameter
Description
Spark Version
Select Spark2.x.
Language
Select Python.
Main Python Resource
The Python resource spark_is_number.py that you created.
Go to Operation Center in the development environment to backfill data for the ODPS Spark node. For more information, see Backfill data and view data backfill instances (new version).
Note
DataWorks does not provide entry points for you to run ODPS Spark nodes in DataStudio. You must run ODPS Spark nodes in Operation Center in the development environment.
View the result.
After the data backfill instance is successfully run, click tracking URL in the run logs that are generated to view the result. The following information is returned:
```
False
True
True
True
True
```

Advanced code editing examples

For more information about the development of Spark on MaxCompute tasks in other scenarios, see the following topics:

What to do next

After you complete the development of the Spark on MaxCompute task, you can perform the following operations:

Configure scheduling properties: You can configure properties for periodic scheduling of the node. If you want the system to periodically schedule and run the task, you must configure items for the node, such as rerun settings and scheduling dependencies. For more information, see Overview.
Debug the node: You can debug and test the code of the node to check whether the code logic meets your expectations. For more information, see Debugging procedure.
Deploy the node: After you complete all development operations, you can deploy the node. After the node is deployed, the system periodically schedules the node based on the scheduling properties of the node. For more information, see Deploy nodes.

Enable the system to diagnose Spark tasks: MaxCompute provides the Logview tool and Spark Web UI. You can view the logs of Spark tasks to check whether the tasks are submitted and run as expected.

Parameter	Description
Spark Version	Select Spark2.x.
Language	Select Python.
Main Python Resource	The Python resource spark_is_number.py that you created.