The Lindorm compute engine provides a RESTful API to submit Spark Python jobs. You can use this API to run streaming and batch tasks, machine learning tasks, and graph computing tasks. This topic describes how to develop and submit a Python job for the Lindorm compute engine.
Prerequisites
You have activated the Lindorm compute engine. For more information, see Activate the service.
Spark Python Job Development Process
Step 1: Define a Python-based Spark job
Click Sample Spark job to download the sample package.
Extract the downloaded package. The extracted folder is named
lindorm-spark-examples. Go to thelindorm-spark-examples/pythondirectory and review the Python directory structure.This section describes the project directory structure, assuming that
your_projectis the project root directory.Create an empty file named
__init__.pyin theyour_projectdirectory.Modify the entry file.
Add the path of the
your_projectdirectory tosys.path. For details, see the Notice1 section inlindorm-spark-examples/python/your_project/main.py.# Notice1: You need to do the following step to complete the code modification: # Step1: Please add a "__init__.py" to your project directory, your project will act as a module of launcher.py # Step2: Please add current dir to sys.path, you should add the following code to your main file current_dir = os.path.abspath(os.path.dirname(__file__)) sys.path.append(current_dir) print("current dir in your_project: %s" % current_dir) print("sys.path: %s \n" % str(sys.path))Encapsulate the entry logic into the
main(argv)method. For details, see the Notice2 section inlindorm-spark-examples/python/your_project/main.py.# Notice2: Move the code in `if __name__ == "__main__":` branch to a new defined main(argv) function, # so that launcher.py in parent directory just call main(sys.argv) def main(argv): print("Receive arguments: %s \n" % str(argv)) print("current dir in main: %s \n" % os.path.abspath(os.path.dirname(__file__))) # Write your code here if __name__ == "__main__": main(sys.argv)
Create an entry file to call the
main(argv)method. In the root directoryyour_project, create a file namedlauncher.py. You can copy the code fromlindorm-spark-examples/python/launcher.py.
Step 2: Package the Python-based Spark job
Package the Python runtime environment and third-party libraries that your project depends on. We recommend using Conda or Virtualenv to package these dependencies into a tar file. For more information, see Python Package Management.
ImportantUse the spark.archives parameter to pass tar files created by Conda or Virtualenv. All formats supported by spark.archives are valid. For more information, see spark.archives.
Complete this step in Linux to ensure the Lindorm compute engine recognizes Python binary files.
Package the project files. Compress the
your_projectdirectory into a.zipor.eggfile.Run the following command to create a
.zipfile:zip -r project.zip your_projectTo create a
.eggfile, see Building Eggs.
Step 3: Upload the files of the Python-based Spark job
Upload the following files to OSS. For more information, see Simple upload.
Step 4: Submit the Python-based Spark job
The Lindorm compute engine supports two ways to submit and manage jobs:
Submit jobs in the Lindorm console. For more information, see Manage jobs in the console.
Submit jobs using DMS. For more information, see Manage jobs using DMS.
Request parameters consist of the following two parts:
Parameters for the Python job runtime environment. Example:
{"spark.archives":"oss://testBucketName/pyspark_conda_env.tar.gz#environment", "spark.kubernetes.driverEnv.PYSPARK_PYTHON":"./environment/bin/python","spark.submit.pyFiles":"oss://testBucketName/your_project.zip"}When submitting project files (
.zip,.egg, or.py), set spark.submit.pyFiles in the configs parameter.When submitting the tar file containing the Python runtime environment and third-party libraries, set spark.archives and spark.kubernetes.driverEnv.PYSPARK_PYTHON in the configs parameter.
Use a number sign (#) to specify targetDir in the spark.archives parameter.
Set spark.kubernetes.driverEnv.PYSPARK_PYTHON to the path of the Python executable.
If you upload files to OSS, configure the following parameters in the configs parameter.
Table 1. Configs parameters
Parameter
Example
Description
spark.hadoop.fs.oss.endpoint
oss-cn-beijing-internal.aliyuncs.com
The endpoint of the OSS bucket where you store Python files.
spark.hadoop.fs.oss.accessKeyId
testAccessKey ID
The AccessKey ID and AccessKey secret that you create in the Alibaba Cloud Management Console. For more information, see Create an AccessKey pair.
spark.hadoop.fs.oss.accessKeySecret
testAccessKey Secret
spark.hadoop.fs.oss.impl
org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
The class used to access OSS.
NoteFor more parameters, see Parameters.
Python Job Development Example
Click Sample Spark job to download and extract the file.
Open the your_project/main.py file and modify the entry point.
Add the your_project directory to sys.path.
current_dir = os.path.abspath(os.path.dirname(__file__)) sys.path.append(current_dir) print("current dir in your_project: %s" % current_dir) print("sys.path: %s \n" % str(sys.path))Add the entry logic to the main.py file. The following example initializes a SparkSession.
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("PythonImportTest") \ .getOrCreate() print(spark.conf) spark.stop()
In the Python directory, compress the your_project directory into a ZIP file.
zip -r your_project.zip your_projectIn Linux, use Conda to package the Python runtime environment.
conda create -y -n pyspark_conda_env -c conda-forge numpy conda-pack conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gzUpload your_project.zip, pyspark_conda_env.tar.gz, and launcher.py to OSS.
Submit the job using one of the following methods:
Submit the job in the Lindorm console. For more information, see Manage jobs in the console.
Submit the job using DMS. For more information, see Manage jobs using DMS.
Job diagnostics
After you submit a Python job, view its status and Spark UI address on the Jobs page. For more information, see View a job. If you encounter issues during submission, submit a ticket. Provide the job ID and Spark UI address to support staff.