All Products
Search
Document Center

E-MapReduce:Use third-party Python libraries in a notebook

Last Updated:Sep 14, 2024

Third-party Python libraries are often used to enhance the data processing and analysis capabilities of interactive PySpark jobs that run in notebooks. This topic describes how to install third-party Python libraries in a notebook.

Background information

When you develop interactive PySpark jobs, you can use third-party Python libraries to enable more flexible and easier data processing and analysis. The following table describes the methods to use third-party Python libraries in a notebook. You can select a method based on your business requirements.

Method

Scenario

Method 1: Run the pip command to install Python libraries

Process variables that are not related to Spark in a notebook, such as the return values calculated by Spark or custom variables.

Important

You must reinstall the libraries after you restart a notebook compute.

Method 2: Add Spark configurations to create a custom Python environment

Use third-party Python libraries to process data in PySpark jobs. For example, you can use third-party Python libraries to implement Spark distributed computing.

Prerequisites

Procedure

Method 1: Run the pip command to install Python libraries

  1. Go to the Development page.

    1. Log on to the E-MapReduce (EMR) console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, find the desired workspace and click the name of the workspace.

    4. In the left-side navigation pane of the EMR Serverless Spark page, click Development.

    5. Double-click the notebook that you developed.

  2. In a Python cell of the notebook, enter the following command to install the scikit-learn library and click the image icon.

    pip install scikit-learn
  3. Add a new Python cell, enter the following command in the cell, and then click the image icon.

    # Import datasets from the scikit-learn library.
    from sklearn import datasets
    
    # Load the built-in dataset, such as the Iris dataset.
    iris = datasets.load_iris()
    X = iris.data # The feature data.
    y = iris.tar get # The tag.
    
    # Divide datasets.
    from sklearn.model_selection import train_test_split
    
    # Divide the datasets into training sets and test sets.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Use the support vector machine (SVM) model for training.
    from sklearn.svm import SVC
    
    # Create a classifier instance.
    clf = SVC(kernel='linear')  # Use a linear kernel.
    
    # Train the model.
    clf.fit(X_train, y_train)
    
    # Use the trained model to make predictions.
    y_pred = clf.predict(X_test)
    
    # Evaluate the model performance.
    from sklearn.metrics import classification_report, accuracy_score
    
    print(classification_report(y_test, y_pred))
    print("Accuracy:", accuracy_score(y_test, y_pred))
    

    The following figure shows the results.

    image

Method 2: Add Spark configurations to create a custom Python environment

If you use this method, make sure that the versions of ipykernel and jupyter_client meet the requirements, the Python version is V3.8 or later, and the environment is packaged in Linux.

Step 1: Create and deploy a Conda environment

  1. Run the following commands to install Miniconda:

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    chmod +x Miniconda3-latest-Linux-x86_64.sh
    
    ./Miniconda3-latest-Linux-x86_64.sh -b
    source miniconda3/bin/activate
  2. Create a Conda environment for using Python 3.8 and NumPy.

    # Create a Conda environment and activate the environment.
    conda create -y -n pyspark_conda_env python=3.8
    conda activate pyspark_conda_env
    # Install third-party libraries.
    pip install numpy \
    ipykernel~=6.29 \
    jupyter_client~=8.6 \
    jieba \
    conda-pack
    # Package the environment.
    conda pack -f -o pyspark_conda_env.tar.gz

Step 2: Upload the resource file to OSS

Upload the pyspark_conda_env.tar.gz package to Object Storage Service (OSS) and copy the complete OSS path of the package. For more information, see Simple upload.

Step 3: Configure and start the notebook compute

Note

You must stop the compute before you edit it.

  1. Go to the Notebook Compute tab.

    1. In the left-side navigation pane of the EMR Serverless Spark page, choose Admin > Compute.

    2. On the Compute page, click the Notebook Compute tab.

  2. On the Notebook Compute tab, find the desired notebook compute and click Edit in the Actions column.

  3. On the Edit Notebook Compute page, add the following configurations to the Spark Configuration field and click Save Changes.

    spark.archives  oss://<yourBucket>/path/to/pyspark_conda_env.tar.gz#env
    spark.pyspark.python ./env/bin/python
    Note

    Replace <yourBucket>/path/to in the code with the OSS path that you copied in the previous step.

  4. Click Start in the upper-right corner of the page.

Step 4: Use the Jieba tool to process text data

Note

Jieba is a third-party Python library for Chinese text segmentation. For information about the license, see LICENSE.

  1. Go to the Development page.

    1. In the left-side navigation pane of the EMR Serverless Spark page, click Development.

    2. Double-click the notebook that you developed.

  2. In a new Python cell, enter the following command to use the Jieba tool to segment the Chinese text and click the image icon.

    import jieba
    
    strs = ["EMR Serverless Spark is a fully-managed serverless service for large-scale data processing and analysis.", "EMR Serverless Spark supports efficient end-to-end services, such as task development, debugging, scheduling, and O&M.", "EMR Serverless Spark supports resource scheduling and dynamic scaling based on job loads."]
    
    sc.parallelize(strs).flatMap(lambda s: jieba.cut(s, use_paddle=True)).collect()

    The following figure shows the results.

    image