All Products
Search
Document Center

E-MapReduce:Get started with notebook development

Last Updated:Sep 25, 2024

E-MapReduce (EMR) Serverless Spark allows you to perform interactive development by using notebooks. This topic describes how to create and run a notebook.

Prerequisites

Procedure

Step 1: Prepare a test file

This topic provides a test file to familiarize you with notebook-based jobs. You can download the test file for use in subsequent steps.

Click employee.csv to download the test file.

Note

The employee.csv file contains data such as employee names, departments, and salaries.

Step 2: Upload the test file

For information about how to upload the employee.csv file to the Object Storage Service (OSS) console, see Simple upload.

Step 3: Develop and run a notebook

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Development.

  2. Create a notebook.

    1. On the Development tab, click New.

    2. In the New dialog box, enter a name in the Name field, choose Python > Notebook from the Type drop-down list, and then click OK.

  3. In the upper-right corner of the Development page, select the notebook compute that you created and started from the SQL Compute drop-down list.

    You can also select Create Notebook Compute from the Notebook Compute drop-down list to create a notebook compute. For more information about notebook computes, see Manage notebook computes.

    Note

    A notebook compute can be used only by a single notebook at a time. If no notebook compute is available, you can detach a notebook from a notebook compute in the Notebook Compute drop-down list to make the notebook compute available. You can also create a notebook compute.

  4. Perform data processing and visualized analysis.

    Run a PySpark job

    1. Copy the following code to the Python cell of the created notebook.

      # Create a simple DataFrame object. Replace the OSS path with the path of the file that you uploaded in Step 2. 
      df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file")
      # Display the first few rows of the DataFrame object.
      df.show(5)
      # Perform a simple aggregation operation to calculate the total salary of each department.
      sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()
    2. Click Execute All Cells to run the notebook.

      You can also use a different cell and then click the image icon in front of the cell.

      image

    3. Optional. View the job information in the Spark UI.

      In the Notebook Compute drop-down list, move the pointer over the image icon of the notebook compute and click Spark UI to go to the Spark Jobs page. You can view the Spark job information on the page.

      image

    Perform visualized analysis by using third-party libraries

    Note

    The Matplotlib, NumPy, and pandas libraries are pre-installed in notebook computes. For information about how to use other third-party libraries, see Use third-party Python libraries in a notebook.

    1. Use the Matplotlib library to perform visualized analysis of data.

      import matplotlib.pyplot as plt
      
      l = sc.parallelize(range(20)).collect()
      plt.plot(l)
      plt.ylabel('some numbers')
      plt.show()
    2. Click Execute All Cells to run the notebook.

      You can also use a different cell and then click the image icon in front of the cell.

      image

Step 4: Publish the notebook

  1. Confirm that the job runs as expected. Then, in the upper-right corner of the page, click Publish.

  2. In the Publish dialog box, configure the Remarks parameter and click OK.