All Products
Search
Document Center

E-MapReduce:Get started with notebook development

Last Updated:Oct 28, 2024

E-MapReduce (EMR) Serverless Spark allows you to perform interactive development by using notebooks. This topic describes how to create and run a notebook.

Prerequisites

Procedure

Step 1: Prepare a test file

This topic provides a test file to help you quickly get started with notebook-based jobs. You can download the test file for use in subsequent steps.

Click employee.csv to download the test file.

Note

The employee.csv file contains data such as employee names, departments, and salaries.

Step 2: Upload the test file

Upload the employee.csv file to the Object Storage Service (OSS) console. For more information, see Simple upload.

Step 3: Develop and run a notebook

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

  2. Create a notebook.

    1. On the Development tab, click Create.

    2. In the Create dialog box, configure the Name parameter, choose Python > Notebook from the Type drop-down list, and then click OK.

  3. In the upper-right corner of the configuration tab that appears, select the notebook session that you created and started from the drop-down list.

    You can also select Create Notebook Session from the drop-down list to create a notebook session. For more information about notebook sessions, see Manage notebook sessions.

    Note

    A notebook session can be used only by a single notebook at a time. If no notebook session is available, you can detach a notebook from a notebook session in the notebook session drop-down list to make the notebook session available. You can also create a notebook session.

  4. Perform data processing and visualized analysis.

    Run a PySpark job

    1. Copy the following code to the Python cell of the created notebook.

      # Create a simple DataFrame object. Replace the OSS path with the path of the file that you uploaded in Step 2. 
      df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file")
      # Display the first few rows of the DataFrame object.
      df.show(5)
      # Perform a simple aggregation operation to calculate the total salary of each department.
      sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()
    2. Click Run All Cells to run the notebook.

      You can also use a different cell and click the image icon in front of the cell.

      image

    3. Optional. View job information on the Spark UI.

      In the notebook session drop-down list, move the pointer over the image icon of the desired notebook session and click Spark UI to go to the Spark Jobs page. You can view the Spark job information on this page.

      image

    Perform visualized analysis by using third-party libraries

    Note

    The Matplotlib, NumPy, and pandas libraries are pre-installed in notebook sessions. For information about how to use other third-party libraries, see Use third-party Python libraries in a notebook.

    1. Use the Matplotlib library to perform visualized analysis of data.

      import matplotlib.pyplot as plt
      
      l = sc.parallelize(range(20)).collect()
      plt.plot(l)
      plt.ylabel('some numbers')
      plt.show()
    2. Click Run All Cells to run the notebook.

      You can also use a different cell and click the image icon in front of the cell.

      image

Step 4: Publish the notebook

  1. Confirm that the notebook runs as expected. Then, click Publish in the upper-right corner of the configuration tab of the notebook.

  2. In the Publish dialog box, configure the Remarks parameter and click OK.