All Products
Search
Document Center

E-MapReduce:Get started with the development of PySpark batch jobs

Last Updated:Oct 28, 2024

You can build a Python script that contains business logic and upload the script for the development of PySpark batch jobs in a convenient manner. This topic provides an example on how to develop a PySpark batch job.

Prerequisites

Procedure

Step 1: Prepare test files

You can use an on-premises or a standalone development platform to develop Python files and then submit the files to E-MapReduce (EMR) Serverless Spark for running. This topic provides test files to help you quickly get started with PySpark batch jobs. You can download the test files for use in subsequent steps.

Click DataFrame.py and employee.csv to download the test files.

Note
  • The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.

  • The employee.csv file contains data such as employee names, departments, and salaries.

Step 2: Upload test files

  1. Upload a Python file to EMR Serverless Spark.

    1. Go to the Files page.

      1. Log on to the EMR console.

      2. In the left-side navigation pane, choose EMR Serverless > Spark.

      3. On the Spark page, find the desired workspace and click the name of the workspace.

      4. In the left-side navigation pane of the EMR Serverless Spark page, click Files.

    2. On the Files page, click Upload File.

    3. In the Upload File dialog box, click the area in a dotted-line rectangle to select a Python file, or directly drag a Python file to the area.

      In this example, the DataFrame.py file is uploaded.

  2. Upload the employee.csv file to the OSS console. For more information, see Simple upload.

Step 3: Develop and run a job

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

  2. On the Development tab, click Create.

  3. In the Create dialog box, configure the Name parameter, choose Batch Job > PySpark from the Type drop-down list, and then click OK.

  4. In the upper-right corner of the configuration tab of the job, select a queue for the job.

    For information about how to add a queue, see Manage resource queues.

  5. Configure the parameters that are described in the following table for the job and click Run. You do not need to configure other parameters.

    Parameter

    Description

    Main Python Resources

    Select the Python file that you uploaded on the Files page in the previous step. In this example, the DataFrame.py file is selected.

    Execution Parameters

    Enter the path of the employee.csv file that you uploaded to the OSS console. Example: oss://<yourBucketName>/employee.csv.

  6. After you run the job, click Details in the Actions column of the job on the Execution Records tab.

  7. On the Development Job Runs tab of the Job History page, view related logs.

    image

Step 4: Publish the job

Important

A published job can be run on a workflow node.

  1. Confirm that the job runs as expected. Then, click Publish in the upper-right corner of the configuration tab of the job.

  2. In the Publish dialog box, configure the Remarks parameter and click OK.

Step 5: View job information on the Spark UI

After the job runs as expected, you can view the running details of the job on the Spark UI.

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Job History.

  2. On the Job History page, click the Development Job Runs tab.

  3. On the Development Job Runs tab, find the desired job and click Details in the Actions column.

  4. On the Overview tab, click Spark UI in the Spark UI field.

    image

  5. On the Spark Jobs page, view the running details of the job.

    image

References