All Products
Search
Document Center

E-MapReduce:Get started with notebook development

Last Updated:Sep 13, 2025

EMR Serverless Spark supports interactive development using notebooks. This topic describes how to create and run a notebook.

Prerequisites

Procedure

Step 1: Prepare a test file

This topic provides a test file to help you familiarize yourself with notebook jobs. You can download the test file to use in the following steps.

Click employee.csv to download the test file.

Note

The employee.csv file contains data for employee names, departments, and salaries.

Step 2: Upload the test file

Upload the data file (employee.csv) to the Object Storage Service (OSS) console. For more information, see Upload files.

Step 3: Develop and run a notebook

  1. On the EMR Serverless Spark page, click Data Development in the navigation pane on the left.

  2. Create a notebook.

    1. On the Development Folder tab, click the image icon.

    2. In the dialog box that appears, enter a name, select Interactive Development > Notebook as the Type, and then click OK.

  3. In the upper-right corner, select a running notebook session instance.

    You can also select Create Notebook Session from the drop-down list to create a new notebook session instance. For more information about notebook sessions, see Manage notebook sessions.

    Note

    Multiple notebooks can share a single notebook session instance. This lets you access and use the same session resources from multiple notebooks at the same time, without needing to create a new session instance for each notebook.

  4. Process and visualize data.

    Run a PySpark job

    1. Copy the following code to the Python cell of the new notebook.

      # Create a simple DataFrame. Replace the OSS path with the path of the file that you uploaded in Step 2.
      df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file")
      # Display the first few rows of the DataFrame.
      df.show(5)
      # Perform a simple aggregate operation to calculate the total salary of each department.
      sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()
    2. Click Run All Cells to run the notebook.

      You can also run a specific cell by clicking the image icon in front of the cell.

      image

    3. (Optional) View the Spark UI.

      In the session drop-down list, hover over the image icon for the current notebook session and click Spark UI. You are redirected to the Spark Jobs page, where you can view the Spark job information.

      image

    Perform visual analytics using third-party libraries

    Note

    Notebook sessions have the matplotlib, numpy, and pandas libraries pre-installed. For more information about how to use other third-party libraries, see Use third-party Python libraries in a notebook.

    1. Use the matplotlib library to visualize data.

      import matplotlib.pyplot as plt
      
      l = sc.parallelize(range(20)).collect()
      plt.plot(l)
      plt.ylabel('some numbers')
      plt.show()
    2. Click Run All Cells to run the notebook.

      You can also run a specific cell by clicking the image icon in front of the cell.

      image

Step 4: Publish the notebook

  1. After the notebook finishes running, click Publish in the upper-right corner.

  2. In the Publish dialog box, configure the parameters and click OK to save the notebook as a new version.