All Products
Search
Document Center

E-MapReduce:Use third-party Python libraries in a notebook

Last Updated:Nov 05, 2024

Third-party Python libraries are often used to enhance the data processing and analysis capabilities of interactive PySpark jobs that run in notebooks. This topic describes how to install third-party Python libraries in a notebook.

Background information

When you develop interactive PySpark jobs, you can use third-party Python libraries to enable more flexible and easier data processing and analysis. The following table describes the methods to use third-party Python libraries in a notebook. You can select a method based on your business requirements.

Method

Scenario

Method 1: Run the pip command to install Python libraries

You want to process variables that are not related to Spark in a notebook, such as the return values calculated by Spark or custom variables.

Important

You must reinstall the libraries after you restart a notebook session.

Method 2: Create a runtime environment to define a custom Python environment

You want to use third-party Python libraries to process data in PySpark jobs, and you want the third-party libraries to be preinstalled each time a notebook session is started.

Method 3: Add Spark configurations to create a custom Python environment

You want to use third-party Python libraries to process data in PySpark jobs. For example, you use third-party Python libraries to implement Spark distributed computing.

Prerequisites

Procedure

Method 1: Run the pip command to install Python libraries

  1. Go to the configuration tab of a notebook.

    1. Log on to the E-MapReduce (EMR) console.

    2. In the left-side navigation pane, choose EMR Serverless>Spark.

    3. On the Spark page, find the desired workspace and click the name of the workspace.

    4. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

    5. Double-click the notebook that you developed.

  2. In a Python cell of the notebook, enter the following command to install the scikit-learn library and click the image icon.

    pip install scikit-learn
  3. Add a new Python cell, enter the following command in the cell, and then click the image icon.

    # Import datasets from the scikit-learn library. 
    from sklearn import datasets
    
    # Load the built-in dataset, such as the Iris dataset. 
    iris = datasets.load_iris()
    X = iris.data # The feature data.
    y = iris.tar get # The tag.
    
    # Divide datasets. 
    from sklearn.model_selection import train_test_split
    
    # Divide the datasets into training sets and test sets. 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Use the support vector machine (SVM) model for training. 
    from sklearn.svm import SVC
    
    # Create a classifier instance. 
    clf = SVC(kernel='linear') # Use a linear kernel. 
    
    # Train the model. 
    clf.fit(X_train, y_train)
    
    # Use the trained model to make predictions. 
    y_pred = clf.predict(X_test)
    
    # Evaluate the model performance. 
    from sklearn.metrics import classification_report, accuracy_score
    
    print(classification_report(y_test, y_pred))
    print("Accuracy:", accuracy_score(y_test, y_pred))
    

    The following figure shows the results.

    image

Method 2: Create a runtime environment to define a custom Python environment

Step 1: Create a runtime environment

  1. Go to the Runtime Environments page.

    1. Log on to the EMR console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, find the desired workspace and click the name of the workspace.

    4. In the left-side navigation pane of the EMR Serverless Spark page, click Runtime Environments.

  2. Click Create Runtime Environment.

  3. On the Create Runtime Environment page, configure the Name parameter. Then, click Add Library in the Libraries section.

    For more information, see Manage runtime environments.

  4. In the Create Library dialog box, set the Source Type parameter to PyPI, configure the PyPI Package parameter, and then click OK.

    Specify the library name and version in the required format for the PyPI Package parameter. If you do not specify a version, the library of the latest version is installed. Example: scikit-learn.

  5. Click Create.

    The system starts to initialize the runtime environment after you click Create.

Step 2: Use the runtime environment

Note

You must stop a session before you modify the session.

  1. Go to the Notebook Sessions tab.

    1. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.

    2. Click the Notebook Sessions tab.

  2. Find the desired session and click Edit in the Actions column.

  3. Select the created runtime environment from the Runtime Environment drop-down list and click Save Changes.

  4. In the upper-right corner of the page, click Start.

Step 3: Use the Scikit-learn library to classify data

  1. Go to the configuration tab of the desired notebook.

    1. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

    2. Double-click the notebook that you developed.

  2. Add a new Python cell, enter the following commands in the cell, and then click the image icon.

    # Import datasets from the scikit-learn library. 
    from sklearn import datasets
    
    # Load the built-in dataset, such as the Iris dataset. 
    iris = datasets.load_iris()
    X = iris.data # The feature data.
    y = iris.tar get # The tag.
    
    # Divide datasets. 
    from sklearn.model_selection import train_test_split
    
    # Divide the datasets into training sets and test sets. 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Use the SVM model for training. 
    from sklearn.svm import SVC
    
    # Create a classifier instance. 
    clf = SVC(kernel='linear') # Use a linear kernel. 
    
    # Train the model. 
    clf.fit(X_train, y_train)
    
    # Use the trained model to make predictions. 
    y_pred = clf.predict(X_test)
    
    # Evaluate the model performance. 
    from sklearn.metrics import classification_report, accuracy_score
    
    print(classification_report(y_test, y_pred))
    print("Accuracy:", accuracy_score(y_test, y_pred))
    

    The following figure shows the results.

    image

Method 3: Add Spark configurations to create a custom Python environment

If you use this method, make sure that the versions of ipykernel and jupyter_client meet the requirements, the Python version is V3.8 or later, and the environment is packaged in Linux.

Step 1: Create and deploy a Conda environment

  1. Run the following commands to install Miniconda:

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    chmod +x Miniconda3-latest-Linux-x86_64.sh
    
    ./Miniconda3-latest-Linux-x86_64.sh -b
    source miniconda3/bin/activate
  2. Create a Conda environment for using Python 3.8 and NumPy.

    # Create a Conda environment and activate the environment.
    conda create -y -n pyspark_conda_env python=3.8
    conda activate pyspark_conda_env
    # Install third-party libraries.
    pip install numpy \
    ipykernel~=6.29 \
    jupyter_client~=8.6 \
    jieba \
    conda-pack
    # Package the environment.
    conda pack -f -o pyspark_conda_env.tar.gz

Step 2: Upload the resource file to OSS

Upload the pyspark_conda_env.tar.gz package to Object Storage Service (OSS) and copy the complete OSS path of the package. For more information, see Simple upload.

Step 3: Configure and start the notebook session

Note

You must stop a session before you modify the session.

  1. Go to the Notebook Sessions tab.

    1. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.

    2. Click the Notebook Sessions tab.

  2. Find the desired session and click Edit in the Actions column.

  3. On the Modify Notebook Session page, add the following configurations to the Spark Configuration field and click Save Changes.

    spark.archives  oss://<yourBucket>/path/to/pyspark_conda_env.tar.gz#env
    spark.pyspark.python ./env/bin/python
    Note

    Replace <yourBucket>/path/to in the code with the OSS path that you copied in the previous step.

  4. In the upper-right corner of the page, click Start.

Step 4: Use the Jieba tool to process text data

Note

Jieba is a third-party Python library for Chinese text segmentation. For information about the license, see LICENSE.

  1. Go to the configuration tab of the desired notebook.

    1. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

    2. Double-click the notebook that you developed.

  2. In a new Python cell, enter the following command to use the Jieba tool to segment the Chinese text and click the image icon.

    import jieba
    
    strs = ["EMR Serverless Spark is a fully-managed serverless service for large-scale data processing and analysis.", "EMR Serverless Spark supports efficient end-to-end services, such as task development, debugging, scheduling, and O&M.", "EMR Serverless Spark supports resource scheduling and dynamic scaling based on job loads."]
    
    sc.parallelize(strs).flatMap(lambda s: jieba.cut(s, use_paddle=True)).collect()

    The following figure shows the results.

    image