E-MapReduce (EMR) Serverless Spark allows you to perform interactive development by using notebooks. This topic describes how to create and run a notebook.
Prerequisites
An Alibaba Cloud account is created. You can create an Alibaba Cloud account on the Alibaba Cloud signup page.
The required roles are assigned to the Alibaba Cloud account. For more information, see Assign roles to an Alibaba Cloud account.
A workspace and a notebook compute are created. For more information, see Create a workspace and Manage notebook computes.
Procedure
Step 1: Prepare a test file
This topic provides a test file to familiarize you with notebook-based jobs. You can download the test file for use in subsequent steps.
Click employee.csv to download the test file.
The employee.csv file contains data such as employee names, departments, and salaries.
Step 2: Upload the test file
For information about how to upload the employee.csv file to the Object Storage Service (OSS) console, see Simple upload.
Step 3: Develop and run a notebook
In the left-side navigation pane of the EMR Serverless Spark page, click Development.
Create a notebook.
On the Development tab, click New.
In the New dialog box, enter a name in the Name field, choose
from the Type drop-down list, and then click OK.
In the upper-right corner of the Development page, select the notebook compute that you created and started from the SQL Compute drop-down list.
You can also select Create Notebook Compute from the Notebook Compute drop-down list to create a notebook compute. For more information about notebook computes, see Manage notebook computes.
NoteA notebook compute can be used only by a single notebook at a time. If no notebook compute is available, you can detach a notebook from a notebook compute in the Notebook Compute drop-down list to make the notebook compute available. You can also create a notebook compute.
Perform data processing and visualized analysis.
Run a PySpark job
Copy the following code to the Python cell of the created notebook.
# Create a simple DataFrame object. Replace the OSS path with the path of the file that you uploaded in Step 2. df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file") # Display the first few rows of the DataFrame object. df.show(5) # Perform a simple aggregation operation to calculate the total salary of each department. sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()
Click Execute All Cells to run the notebook.
You can also use a different cell and then click the icon in front of the cell.
Optional. View the job information in the Spark UI.
In the Notebook Compute drop-down list, move the pointer over the icon of the notebook compute and click Spark UI to go to the Spark Jobs page. You can view the Spark job information on the page.
Perform visualized analysis by using third-party libraries
NoteThe Matplotlib, NumPy, and pandas libraries are pre-installed in notebook computes. For information about how to use other third-party libraries, see Use third-party Python libraries in a notebook.
Use the Matplotlib library to perform visualized analysis of data.
import matplotlib.pyplot as plt l = sc.parallelize(range(20)).collect() plt.plot(l) plt.ylabel('some numbers') plt.show()
Click Execute All Cells to run the notebook.
You can also use a different cell and then click the icon in front of the cell.
Step 4: Publish the notebook
Confirm that the job runs as expected. Then, in the upper-right corner of the page, click Publish.
In the Publish dialog box, configure the Remarks parameter and click OK.