You can build a Python script that contains business logic and upload the script for the development of PySpark batch jobs in a convenient manner. This topic provides an example on how to develop a PySpark batch job.
Prerequisites
An Alibaba Cloud account is created. You can create an Alibaba Cloud account on the Alibaba Cloud signup page.
The required roles are assigned to the Alibaba Cloud account. For more information, see Assign roles to an Alibaba Cloud account.
A workspace is created. For more information, see Create a workspace.
Procedure
Step 1: Prepare test files
You can use an on-premises or a standalone development platform to develop Python files and then submit the files to E-MapReduce (EMR) Serverless Spark for running. This topic provides test files to help you quickly get started with PySpark batch jobs. You can download the test files for use in subsequent steps.
Click DataFrame.py and employee.csv to download the test files.
The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.
The employee.csv file contains data such as employee names, departments, and salaries.
Step 2: Upload test files
Upload a Python file to EMR Serverless Spark.
Go to the Files page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, find the desired workspace and click the name of the workspace.
In the left-side navigation pane of the EMR Serverless Spark page, click Files.
On the Files page, click Upload File.
In the Upload File dialog box, click the area in a dotted-line rectangle to select a Python file, or directly drag a Python file to the area.
In this example, the DataFrame.py file is uploaded.
Upload the employee.csv file to the OSS console. For more information, see Simple upload.
Step 3: Develop and run a job
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
On the Development tab, click Create.
In the Create dialog box, configure the Name parameter, choose
from the Type drop-down list, and then click OK.In the upper-right corner of the configuration tab of the job, select a queue for the job.
For information about how to add a queue, see Manage resource queues.
Configure the parameters that are described in the following table for the job and click Run. You do not need to configure other parameters.
Parameter
Description
Main Python Resources
Select the Python file that you uploaded on the Files page in the previous step. In this example, the DataFrame.py file is selected.
Execution Parameters
Enter the path of the employee.csv file that you uploaded to the OSS console. Example: oss://<yourBucketName>/employee.csv.
After you run the job, click Details in the Actions column of the job on the Execution Records tab.
On the Development Job Runs tab of the Job History page, view related logs.
Step 4: Publish the job
A published job can be run on a workflow node.
Confirm that the job runs as expected. Then, click Publish in the upper-right corner of the configuration tab of the job.
In the Publish dialog box, configure the Remarks parameter and click OK.
Step 5: View job information on the Spark UI
After the job runs as expected, you can view the running details of the job on the Spark UI.
In the left-side navigation pane of the EMR Serverless Spark page, click Job History.
On the Job History page, click the Development Job Runs tab.
On the Development Job Runs tab, find the desired job and click Details in the Actions column.
On the Overview tab, click Spark UI in the Spark UI field.
On the Spark Jobs page, view the running details of the job.
References
After a job is published, you can schedule the job in workflows. For more information, see Manage workflows. For information about a complete job development and orchestration process, see Get started with the development of Spark SQL jobs.
For information about how to develop a PySpark streaming job, see Use EMR Serverless Spark to submit a PySpark streaming job.