All Products
Search
Document Center

AnalyticDB:Access OSS

Last Updated:Jul 05, 2024

AnalyticDB for MySQL Spark allows you to access Object Storage Service (OSS) data within an Alibaba Cloud account or across Alibaba Cloud accounts. This topic describes how to access OSS data within an Alibaba Cloud account or across Alibaba Cloud accounts.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • An AnalyticDB for MySQL cluster is created in the same region as an Object Storage Service (OSS) bucket.

  • A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.

  • A database account is created for the AnalyticDB for MySQL cluster.

  • Authorization is complete. For more information, see Perform authorization.

    Important

    To access OSS data within an Alibaba Cloud account, you must have the AliyunADBSparkProcessingDataRole permission. To access OSS data across Alibaba Cloud accounts, you must perform authorization for other Alibaba Cloud accounts.

Step 1: Prepare data

  1. Prepare a text file for access and upload the file to an OSS bucket. In this example, a file named readme.txt is used. For more information, see Upload objects.

    AnalyticDB for MySQL
    Database service
  2. Compile Python code and upload the code to the OSS bucket. In this example, a Python code file named example.py is used. The Python code file is used to read the first line in the readme.txt file.

    import sys
    
    from pyspark.sql import SparkSession
    
    # Initialize a Spark application.
    spark = SparkSession.builder.appName('OSS Example').getOrCreate()
    # Read the specified text file. The file path is specified by the args parameter.
    textFile = spark.sparkContext.textFile(sys.argv[1])
    # Count and display the number of lines in the text file.
    print("File total lines: " + str(textFile.count()))
    # Display the first line of the text file.
    print("First line is: " + textFile.first())
    

Step 2: Access OSS data

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  2. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  3. In the upper part of the editor, select a job resource group and a Spark application type. In this example, the Batch type is selected.

  4. Run the following Spark code in the editor. Display the total number of lines and the content of the first line in the text file.

    Access OSS data within an Alibaba Cloud account

    {
      "args": ["oss://testBucketName/data/readme.txt"],
      "name": "spark-oss-test",
      "file": "oss://testBucketName/data/example.py",
      "conf": {
        "spark.driver.resourceSpec": "small",
        "spark.executor.resourceSpec": "small",
        "spark.executor.instances": 1
      }
    }

    The following table describes the parameters.

    Parameter

    Description

    args

    The arguments that are passed to the Spark application. Separate multiple arguments with commas (,).

    In this example, the OSS path of the text file is assigned to textFile.

    name

    The name of the Spark application.

    file

    The path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for the Python program.

    Important

    You must store the main files of Spark applications in OSS.

    spark.adb.roleArn

    The Resource Access Management (RAM) role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the acs:ram::<testAccountID>:role/<testUserName> format.

    • <testAccountID>: the ID of the Alibaba Cloud account that owns the external data source.

    • <testUserName>: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see Perform authorization.

    Note

    You do not need to specify this parameter for OSS access within an Alibaba Cloud account.

    conf

    The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

    Access OSS data across Alibaba Cloud accounts

    {
      "args": ["oss://testBucketName/data/readme.txt"],
      "name": "CrossAccount",
      "file": "oss://testBucketName/data/example.py",
      "conf": {
        "spark.adb.roleArn": "acs:ram::testAccountID:role/<testUserName>",
        "spark.driver.resourceSpec": "small",
        "spark.executor.resourceSpec": "small",
        "spark.executor.instances": 1  
      }
    }

    The following table describes the parameters.

    Parameter

    Description

    args

    The arguments that are passed to the Spark application. Separate multiple arguments with commas (,).

    In this example, the OSS path of the text file is assigned to textFile.

    name

    The name of the Spark application.

    file

    The path of the main file of the Spark application. The main file can be a JAR package that contains the entry class or an executable file that serves as the entry point for the Python program.

    Important

    You must store the main files of Spark applications in OSS.

    spark.adb.roleArn

    The Resource Access Management (RAM) role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the acs:ram::<testAccountID>:role/<testUserName> format.

    • <testAccountID>: the ID of the Alibaba Cloud account that owns the external data source.

    • <testUserName>: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see Perform authorization.

    Note

    You do not need to specify this parameter for OSS access within an Alibaba Cloud account.

    conf

    The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the key:value format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

  5. Click Run Now.

    After you run the Spark code, you can click Log in the Actions column on the Applications tab of the Spark JAR Development page to view log information. For more information, see Spark editor.

References