E-MapReduce (EMR) Serverless Spark allows you to perform interactive development by using notebooks. This topic describes how to create and run a notebook.
Prerequisites
An Alibaba Cloud account is created. You can create an Alibaba Cloud account on the Alibaba Cloud signup page.
The required roles are assigned to the Alibaba Cloud account. For more information, see Assign roles to an Alibaba Cloud account.
A workspace and a notebook session are created. For more information, see Create a workspace and Manage notebook sessions.
Procedure
Step 1: Prepare a test file
This topic provides a test file to help you quickly get started with notebook-based jobs. You can download the test file for use in subsequent steps.
Click employee.csv to download the test file.
The employee.csv file contains data such as employee names, departments, and salaries.
Step 2: Upload the test file
Upload the employee.csv file to the Object Storage Service (OSS) console. For more information, see Simple upload.
Step 3: Develop and run a notebook
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
Create a notebook.
On the Development tab, click Create.
In the Create dialog box, configure the Name parameter, choose
from the Type drop-down list, and then click OK.
In the upper-right corner of the configuration tab that appears, select the notebook session that you created and started from the drop-down list.
You can also select Create Notebook Session from the drop-down list to create a notebook session. For more information about notebook sessions, see Manage notebook sessions.
NoteA notebook session can be used only by a single notebook at a time. If no notebook session is available, you can detach a notebook from a notebook session in the notebook session drop-down list to make the notebook session available. You can also create a notebook session.
Perform data processing and visualized analysis.
Run a PySpark job
Copy the following code to the Python cell of the created notebook.
# Create a simple DataFrame object. Replace the OSS path with the path of the file that you uploaded in Step 2. df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file") # Display the first few rows of the DataFrame object. df.show(5) # Perform a simple aggregation operation to calculate the total salary of each department. sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()
Click Run All Cells to run the notebook.
You can also use a different cell and click the icon in front of the cell.
Optional. View job information on the Spark UI.
In the notebook session drop-down list, move the pointer over the icon of the desired notebook session and click Spark UI to go to the Spark Jobs page. You can view the Spark job information on this page.
Perform visualized analysis by using third-party libraries
NoteThe Matplotlib, NumPy, and pandas libraries are pre-installed in notebook sessions. For information about how to use other third-party libraries, see Use third-party Python libraries in a notebook.
Use the Matplotlib library to perform visualized analysis of data.
import matplotlib.pyplot as plt l = sc.parallelize(range(20)).collect() plt.plot(l) plt.ylabel('some numbers') plt.show()
Click Run All Cells to run the notebook.
You can also use a different cell and click the icon in front of the cell.
Step 4: Publish the notebook
Confirm that the notebook runs as expected. Then, click Publish in the upper-right corner of the configuration tab of the notebook.
In the Publish dialog box, configure the Remarks parameter and click OK.