EMR Serverless Spark supports interactive development using notebooks. This topic describes how to create and run a notebook.
Prerequisites
You have an Alibaba Cloud account. For more information, see Alibaba Cloud account registration.
The required roles have been granted. For more information, see Grant roles to an Alibaba Cloud account.
A workspace and a notebook session instance have been created. For more information, see Create a workspace and Manage notebook sessions.
Procedure
Step 1: Prepare a test file
This topic provides a test file to help you familiarize yourself with notebook jobs. You can download the test file to use in the following steps.
Click employee.csv to download the test file.
The employee.csv file contains data for employee names, departments, and salaries.
Step 2: Upload the test file
Upload the data file (employee.csv) to the Object Storage Service (OSS) console. For more information, see Upload files.
Step 3: Develop and run a notebook
On the EMR Serverless Spark page, click Data Development in the navigation pane on the left.
Create a notebook.
On the Development Folder tab, click the
icon.In the dialog box that appears, enter a name, select as the Type, and then click OK.
In the upper-right corner, select a running notebook session instance.
You can also select Create Notebook Session from the drop-down list to create a new notebook session instance. For more information about notebook sessions, see Manage notebook sessions.
NoteMultiple notebooks can share a single notebook session instance. This lets you access and use the same session resources from multiple notebooks at the same time, without needing to create a new session instance for each notebook.
Process and visualize data.
Run a PySpark job
Copy the following code to the Python cell of the new notebook.
# Create a simple DataFrame. Replace the OSS path with the path of the file that you uploaded in Step 2. df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file") # Display the first few rows of the DataFrame. df.show(5) # Perform a simple aggregate operation to calculate the total salary of each department. sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()Click Run All Cells to run the notebook.
You can also run a specific cell by clicking the
icon in front of the cell.
(Optional) View the Spark UI.
In the session drop-down list, hover over the
icon for the current notebook session and click Spark UI. You are redirected to the Spark Jobs page, where you can view the Spark job information.
Perform visual analytics using third-party libraries
NoteNotebook sessions have the matplotlib, numpy, and pandas libraries pre-installed. For more information about how to use other third-party libraries, see Use third-party Python libraries in a notebook.
Use the matplotlib library to visualize data.
import matplotlib.pyplot as plt l = sc.parallelize(range(20)).collect() plt.plot(l) plt.ylabel('some numbers') plt.show()Click Run All Cells to run the notebook.
You can also run a specific cell by clicking the
icon in front of the cell.
Step 4: Publish the notebook
After the notebook finishes running, click Publish in the upper-right corner.
In the Publish dialog box, configure the parameters and click OK to save the notebook as a new version.