AnalyticDB for MySQL Spark allows you to access Object Storage Service (OSS) data within an Alibaba Cloud account or across Alibaba Cloud accounts. This topic describes how to access OSS data within an Alibaba Cloud account or across Alibaba Cloud accounts.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
An AnalyticDB for MySQL cluster is created in the same region as an Object Storage Service (OSS) bucket.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
Authorization is complete. For more information, see Perform authorization.
ImportantTo access OSS data within an Alibaba Cloud account, you must have the AliyunADBSparkProcessingDataRole permission. To access OSS data across Alibaba Cloud accounts, you must perform authorization for other Alibaba Cloud accounts.
Step 1: Prepare data
Prepare a text file for access and upload the file to an OSS bucket. In this example, a file named
readme.txt
is used. For more information, see Upload objects.AnalyticDB for MySQL Database service
Compile Python code and upload the code to the OSS bucket. In this example, a Python code file named
example.py
is used. The Python code file is used to read the first line in the readme.txt file.import sys from pyspark.sql import SparkSession # Initialize a Spark application. spark = SparkSession.builder.appName('OSS Example').getOrCreate() # Read the specified text file. The file path is specified by the args parameter. textFile = spark.sparkContext.textFile(sys.argv[1]) # Count and display the number of lines in the text file. print("File total lines: " + str(textFile.count())) # Display the first line of the text file. print("First line is: " + textFile.first())
Step 2: Access OSS data
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose
.In the upper part of the editor, select a job resource group and a Spark application type. In this example, the Batch type is selected.
Run the following Spark code in the editor. Display the total number of lines and the content of the first line in the text file.
Access OSS data within an Alibaba Cloud account
{ "args": ["oss://testBucketName/data/readme.txt"], "name": "spark-oss-test", "file": "oss://testBucketName/data/example.py", "conf": { "spark.driver.resourceSpec": "small", "spark.executor.resourceSpec": "small", "spark.executor.instances": 1 } }
The following table describes the parameters.
Parameter
Description
args
The arguments that are passed to the Spark application. Separate multiple arguments with commas (,).
In this example, the OSS path of the text file is assigned to
textFile
.name
The name of the Spark application.
file
The path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for the Python program.
ImportantYou must store the main files of Spark applications in OSS.
spark.adb.roleArn
The Resource Access Management (RAM) role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the
acs:ram::<testAccountID>:role/<testUserName>
format.<testAccountID>
: the ID of the Alibaba Cloud account that owns the external data source.<testUserName>
: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see Perform authorization.
NoteYou do not need to specify this parameter for OSS access within an Alibaba Cloud account.
conf
The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the
key:value
format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.Access OSS data across Alibaba Cloud accounts
{ "args": ["oss://testBucketName/data/readme.txt"], "name": "CrossAccount", "file": "oss://testBucketName/data/example.py", "conf": { "spark.adb.roleArn": "acs:ram::testAccountID:role/<testUserName>", "spark.driver.resourceSpec": "small", "spark.executor.resourceSpec": "small", "spark.executor.instances": 1 } }
The following table describes the parameters.
Parameter
Description
args
The arguments that are passed to the Spark application. Separate multiple arguments with commas (,).
In this example, the OSS path of the text file is assigned to
textFile
.name
The name of the Spark application.
file
The path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for the Python program.
ImportantYou must store the main files of Spark applications in OSS.
spark.adb.roleArn
The Resource Access Management (RAM) role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the
acs:ram::<testAccountID>:role/<testUserName>
format.<testAccountID>
: the ID of the Alibaba Cloud account that owns the external data source.<testUserName>
: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see Perform authorization.
NoteYou do not need to specify this parameter for OSS access within an Alibaba Cloud account.
conf
The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the
key:value
format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.Click Run Now.
After you run the Spark code, you can click Log in the Actions column on the Applications tab of the Spark JAR Development page to view log information. For more information, see Spark editor.
References
For information about Spark application development, see Overview of Spark application development.
For information about the configuration parameters of Spark applications, see Spark application configuration parameters.