This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

Develop Spark applications by using PAI DSW

Updated at: 2025-07-31 18:41

Data Science Workshop (DSW) is a cloud-based Integrated Development Environment (IDE) for machine learning provided by PAI. It supports multiple languages and development environments. You can connect to an AnalyticDB for MySQL cluster from a DSW instance and use IDEs, such as Notebook and Terminal, to write PySpark scripts and submit Spark jobs. This topic describes how to submit a Spark job from a DSW instance.

Prerequisites

Step 1: Create and configure a PAI DSW instance

  1. Activate PAI and create a workspace. For more information, see Activate PAI and Create and manage workspaces.

    PAI must be in the same region as AnalyticDB for MySQL.

  2. Create a DSW instance.

    You can use one of the following methods to create a DSW instance:

    • You can create a DSW instance in the console. For more information, see Create a DSW instance.

      You must set Image to Image URL and enter the Livy image URL for AnalyticDB for MySQL Spark: registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre. You can configure other parameters as needed.

    • In Tutorials, click Open In DSW, and then select a DSW instance that meets the requirements or create a new one. For more information, see Create a DSW instance.

      On the DSW instance creation page, the image URL and DSW instance type are pre-filled. You only need to enter an Instance Name and click OK to create the DSW instance.

  3. Access the DSW instance. For more information, see Access from the console.

  4. In the top menu bar, click Terminal and run the following statement to start the Apache Livy proxy.

    cd /root/proxy
    python app.py --db <ClusterID> --rg <Resource Group Name> --e <URL> -i <AK> -k <SK> -t <STS> & 

    Parameters:

    Parameter

    Required

    Description

    Parameter

    Required

    Description

    ClusterID

    Yes

    The ID of the AnalyticDB for MySQL cluster.

    Resource Group Name

    Yes

    The name of the Job resource group in the AnalyticDB for MySQL cluster.

    URL

    Yes

    The service endpoint of the AnalyticDB for MySQL cluster.

    For information about how to view the service endpoint of an AnalyticDB for MySQL cluster, see Service endpoints.

    AK, SK

    Conditionally required

    The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a RAM user that has the permissions to access AnalyticDB for MySQL.

    For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

    Note

    You need to specify AK and SK only when you use an Alibaba Cloud account or a RAM user.

    STS

    Required under specific conditions

    The temporary identity credential of a RAM role, which is the Security Token Service (STS) token.

    An authorized RAM user can use an AccessKey pair to call the AssumeRole operation. This way, the RAM user obtains an STS token of a RAM role and can use the STS token to access Alibaba Cloud resources.

    Note

    You need to specify STS only when you use a RAM role.

    If the following information is returned, the proxy has started successfully:

    2024-11-15 11:04:52,125-ADB-INFO: ADB Client Init
    2024-11-15 11:04:52,125-ADB-INFO: Aliyun ADB Proxy is ready
  5. Check whether a process is listening on port 5000.

    After Step 4 is complete, you can run the netstat -anlp | grep 5000 statement to check whether a process is listening on port 5000.

Step 2: Develop a PySpark application

  1. Access the DSW instance. For more information, see Access from the console.

  2. In the top navigation bar, click Notebook to open the Notebook page.

  3. In the top menu bar, choose File > New > Notebook. In the Select Kernel dialog box, select Python 3 (ipykernel) and click Select.

  4. Run the following statements in sequence to install and load sparkmagic.

    !pip install sparkmagic
    %load_ext sparkmagic.magics
  5. Run the %manage_spark statement.

    After you run the statement, the Create Session tab appears.

  6. On the Create Session tab, set Language to Python and then click Create Session.

    Important

    Click Create Session only once. Do not click it repeatedly.

    After you click Create Session, the status at the bottom of the Notebook page changes to Busy. The session is created when the status changes to Idle and the session ID appears on the Manage Session tab.

    image

  7. Run the PySpark script.

    When you run the PySpark script, you must add the %%spark command before the service code to specify that the remote Spark instance is used.

    %%spark
    db_sql = """
    CREATE DATABASE IF NOT exists test_db comment 'demo db' 
    location 'oss://testBucketName/test'  
    WITH dbproperties(k1='v1', k2='v2')
    """
    
    tb_sql = """
    CREATE TABLE IF NOT exists test_db.test_tbl(id int, name string, age int) 
    using parquet 
    location 'oss://testBucketName/test/test_tbl/' 
    tblproperties ('parquet.compress'='SNAPPY');
    """
    
    insert_sql = """
    INSERT INTO test_db.test_tbl VALUES(1, 'adb', 10);
    """
    
    select_sql = """
    SELECT * FROM test_db.test_tbl;
    """
    
    spark.sql(db_sql).show()
    spark.sql(tb_sql).show()
    spark.sql(insert_sql).show()
    spark.sql(select_sql).show()
    

  • On this page (1)
  • Prerequisites
  • Step 1: Create and configure a PAI DSW instance
  • Step 2: Develop a PySpark application
Feedback