Use AnalyticDB for MySQL Spark to access OSS - AnalyticDB

AnalyticDB for MySQL Spark allows you to access Object Storage Service (OSS) data within an Alibaba Cloud account or across Alibaba Cloud accounts. This topic describes how to access OSS data within an Alibaba Cloud account or across Alibaba Cloud accounts.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
An AnalyticDB for MySQL cluster is created in the same region as an Object Storage Service (OSS) bucket.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
- If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
Authorization is complete. For more information, see Perform authorization.
Important
To access OSS data within an Alibaba Cloud account, you must have the AliyunADBSparkProcessingDataRole permission. To access OSS data across Alibaba Cloud accounts, you must perform authorization for other Alibaba Cloud accounts.

Step 1: Prepare data

Prepare a text file for access and upload the file to an OSS bucket. In this example, a file named readme.txt is used. For more information, see Upload objects.
```
AnalyticDB for MySQL
Database service
```

Compile Python code and upload the code to the OSS bucket. In this example, a Python code file named example.py is used. The Python code file is used to read the first line in the readme.txt file.

import sys

from pyspark.sql import SparkSession

# Initialize a Spark application.
spark = SparkSession.builder.appName('OSS Example').getOrCreate()
# Read the specified text file. The file path is specified by the args parameter.
textFile = spark.sparkContext.textFile(sys.argv[1])
# Count and display the number of lines in the text file.
print("File total lines: " + str(textFile.count()))
# Display the first line of the text file.
print("First line is: " + textFile.first())

Step 2: Access OSS data

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
In the upper part of the editor, select a job resource group and a Spark application type. In this example, the Batch type is selected.

Run the following Spark code in the editor. Display the total number of lines and the content of the first line in the text file.

Access OSS data within an Alibaba Cloud account

{
  "args": ["oss://testBucketName/data/readme.txt"],
  "name": "spark-oss-test",
  "file": "oss://testBucketName/data/example.py",
  "conf": {
    "spark.driver.resourceSpec": "small",
    "spark.executor.resourceSpec": "small",
    "spark.executor.instances": 1
  }
}

The following table describes the parameters.

Parameter	Description
args	The arguments that are passed to the Spark application. Separate multiple arguments with commas (,). In this example, the OSS path of the text file is assigned to `textFile`.
name	The name of the Spark application.
file	The path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for the Python program. Important You must store the main files of Spark applications in OSS.
spark.adb.roleArn	The Resource Access Management (RAM) role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the `acs:ram::<testAccountID>:role/<testUserName>` format. `<testAccountID>`: the ID of the Alibaba Cloud account that owns the external data source. `<testUserName>`: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see Perform authorization. Note You do not need to specify this parameter for OSS access within an Alibaba Cloud account.
conf	The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

Access OSS data across Alibaba Cloud accounts

{
  "args": ["oss://testBucketName/data/readme.txt"],
  "name": "CrossAccount",
  "file": "oss://testBucketName/data/example.py",
  "conf": {
    "spark.adb.roleArn": "acs:ram::testAccountID:role/<testUserName>",
    "spark.driver.resourceSpec": "small",
    "spark.executor.resourceSpec": "small",
    "spark.executor.instances": 1  
  }
}

The following table describes the parameters.

Parameter	Description
args	The arguments that are passed to the Spark application. Separate multiple arguments with commas (,). In this example, the OSS path of the text file is assigned to `textFile`.
name	The name of the Spark application.
file	The path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for the Python program. Important You must store the main files of Spark applications in OSS.
spark.adb.roleArn	The Resource Access Management (RAM) role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the `acs:ram::<testAccountID>:role/<testUserName>` format. `<testAccountID>`: the ID of the Alibaba Cloud account that owns the external data source. `<testUserName>`: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see Perform authorization. Note You do not need to specify this parameter for OSS access within an Alibaba Cloud account.
conf	The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

Click Run Now.
After you run the Spark code, you can click Log in the Actions column on the Applications tab of the Spark JAR Development page to view log information. For more information, see Spark editor.

References

For information about Spark application development, see Overview of Spark application development.
For information about the configuration parameters of Spark applications, see Spark application configuration parameters.