Develop Spark applications by using PAI DSW - AnalyticDB

0.0.201

This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

Data Science Workshop (DSW) is a cloud-based Integrated Development Environment (IDE) for machine learning provided by PAI. It supports multiple languages and development environments. You can connect to an AnalyticDB for MySQL cluster from a DSW instance and use IDEs, such as Notebook and Terminal, to write PySpark scripts and submit Spark jobs. This topic describes how to submit a Spark job from a DSW instance.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A Job resource group is created in the AnalyticDB for MySQL cluster, and the Spark parameter spark.adb.version=3.2 is configured for the resource group.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.
The log storage path of Spark applications is configured for the AnalyticDB for MySQL cluster.
Note
Log on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development. Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.

Step 1: Create and configure a PAI DSW instance

Activate PAI and create a workspace. For more information, see Activate PAI and Create and manage workspaces.
PAI must be in the same region as AnalyticDB for MySQL.
Create a DSW instance.
You can use one of the following methods to create a DSW instance:
- You can create a DSW instance in the console. For more information, see Create a DSW instance.
  You must set Image to Image URL and enter the Livy image URL for AnalyticDB for MySQL Spark: registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre. You can configure other parameters as needed.
- In Tutorials, click Open In DSW, and then select a DSW instance that meets the requirements or create a new one. For more information, see Create a DSW instance.
  On the DSW instance creation page, the image URL and DSW instance type are pre-filled. You only need to enter an Instance Name and click OK to create the DSW instance.
Access the DSW instance. For more information, see Access from the console.

In the top menu bar, click Terminal and run the following statement to start the Apache Livy proxy.

cd /root/proxy
python app.py --db <ClusterID> --rg <Resource Group Name> --e <URL> -i <AK> -k <SK> -t <STS> &

Parameters:

Parameter	Required	Description

Parameter	Required	Description
ClusterID	Yes	The ID of the AnalyticDB for MySQL cluster.
Resource Group Name	Yes	The name of the Job resource group in the AnalyticDB for MySQL cluster.
URL	Yes	The service endpoint of the AnalyticDB for MySQL cluster. For information about how to view the service endpoint of an AnalyticDB for MySQL cluster, see Service endpoints.
AK, SK	Conditionally required	The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a RAM user that has the permissions to access AnalyticDB for MySQL. For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions. Note You need to specify AK and SK only when you use an Alibaba Cloud account or a RAM user.
STS	Required under specific conditions	The temporary identity credential of a RAM role, which is the Security Token Service (STS) token. An authorized RAM user can use an AccessKey pair to call the AssumeRole operation. This way, the RAM user obtains an STS token of a RAM role and can use the STS token to access Alibaba Cloud resources. Note You need to specify STS only when you use a RAM role.

If the following information is returned, the proxy has started successfully:

2024-11-15 11:04:52,125-ADB-INFO: ADB Client Init
2024-11-15 11:04:52,125-ADB-INFO: Aliyun ADB Proxy is ready

Check whether a process is listening on port 5000.
After Step 4 is complete, you can run the netstat -anlp | grep 5000 statement to check whether a process is listening on port 5000.

Step 2: Develop a PySpark application

Access the DSW instance. For more information, see Access from the console.
In the top navigation bar, click Notebook to open the Notebook page.
In the top menu bar, choose File > New > Notebook. In the Select Kernel dialog box, select Python 3 (ipykernel) and click Select.
Run the following statements in sequence to install and load sparkmagic.
```
!pip install sparkmagic
%load_ext sparkmagic.magics
```
Run the %manage_spark statement.
After you run the statement, the Create Session tab appears.
On the Create Session tab, set Language to Python and then click Create Session.
Important
Click Create Session only once. Do not click it repeatedly.
After you click Create Session, the status at the bottom of the Notebook page changes to Busy. The session is created when the status changes to Idle and the session ID appears on the Manage Session tab.

Run the PySpark script.

When you run the PySpark script, you must add the %%spark command before the service code to specify that the remote Spark instance is used.

%%spark
db_sql = """
CREATE DATABASE IF NOT exists test_db comment 'demo db' 
location 'oss://testBucketName/test'  
WITH dbproperties(k1='v1', k2='v2')
"""

tb_sql = """
CREATE TABLE IF NOT exists test_db.test_tbl(id int, name string, age int) 
using parquet 
location 'oss://testBucketName/test/test_tbl/' 
tblproperties ('parquet.compress'='SNAPPY');
"""

insert_sql = """
INSERT INTO test_db.test_tbl VALUES(1, 'adb', 10);
"""

select_sql = """
SELECT * FROM test_db.test_tbl;
"""

spark.sql(db_sql).show()
spark.sql(tb_sql).show()
spark.sql(insert_sql).show()
spark.sql(select_sql).show()

Feedback

Previous: Develop Spark applications using PySparkNext: Develop an interactive Jupyter job

On this page （1）

Prerequisites

Step 1: Create and configure a PAI DSW instance

Step 2: Develop a PySpark application

Find Us

Payment Methods We Support

About

About Alibaba Cloud

Pricing Models

Products

Customers

Partners

Startups

Apsara Conference

Alibaba Cloud Summit

Promotions

Free Trial

Simple Application Server

Explore

China Gateway

ICP License Support

Getting Started

Blog

Marketplace

Training & Certification

Support

Contact Sales

Submit a Ticket

After-Sales Support

Security Report

Feedback

Pricing Calculator

Resources

Documentation Center

Alibaba Cloud MVP

Security & Compliance

Press Room

WHOIS

Status

Products & Solutions

Elastic Compute Service

CDN

Anti-DDoS

Object Storage Service

eCommerce

Web Hosting

Security

Hot Content

ECS Documentation

How to get Domains

Software Infrastructure

New Users

Recommended

Topic Center

Cloud Computing

Industries

Developers

Web Developing

Tutorials

PHP Tutorials

Find Us

Payment Methods We Support

Careers About Us Privacy Policy Legal Notice List Links

Alibaba Group Taobao Marketplace Tmall Juhuasuan AliExpress Alibaba.com 1688 Alimama Fliggy YunOS Amap UCWeb Umeng Xiami DingTalk Alipay

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games New

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Domain Names and Website

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

AgentBay

App Streaming

AI Guardrails

Lingma

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services (IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

Product Map - Alibaba Cloud vs AWS, Azure, and GCP Products

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

Elastic GPU Service Featured

ECS Bare Metal Instance

Simple Application Server (SAS) Hot

Dedicated Host Hot

Compute Nest

CloudBox

Auto Scaling

Elastic Desktop Service (EDS) Featured

Cloud Phone Beta

App Streaming New