All Products
Search
Document Center

AnalyticDB:Access OSS-HDFS

Last Updated:Apr 09, 2024

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark allows you to access OSS-HDFS. This topic describes how to use Spark to access OSS-HDFS.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created in the same region as an Object Storage Service (OSS) bucket.

  • A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.

  • A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.

  • AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources. For more information, see Perform authorization.

Read and write OSS-HDFS data in Spark JAR mode

  1. Write a program that is used to access OSS-HDFS. Then, compile the program into a JAR package that is required for the Spark job. In this example, the JAR package is named oss_hdfs_demo.jar. Sample code:

    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkHDFS {
      def main(args: Array[String]): Unit = {
        val sparkSession = SparkSession
          .builder()
          .appName("Spark HDFS TEST")
          .getOrCreate()
    
        val welcome = "hello, adb-spark"
    
        // Specify the Hadoop Distributed File System (HDFS) directory to store required data.
        val hdfsPath = args(0);
        // Store the welcome string to the specified HDFS directory.
        sparkSession.sparkContext.parallelize(Seq(welcome)).saveAsTextFile(hdfsPath)
        // Read data from the specified HDFS directory and display the data.
        sparkSession.sparkContext.textFile(hdfsPath).collect.foreach(println)
      }
    }
  2. Upload the oss_hdfs_demo.jar package to OSS-HDFS. For more information, see Use Hadoop Shell commands to access OSS-HDFS.

  3. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.

  4. In the left-side navigation pane, choose Job Development > Spark JAR Development.

  5. In the upper part of the editor, select a job resource group and a Spark application type. In this example, the Batch type is selected.

  6. Run the following Spark code in the editor. Display the total number of lines and the content of the first line in the text file.

    {
        "args": ["oss://testBucketName/data/oss_hdfs"],
        "file": "oss://testBucketName/data/oss_hdfs_demo.jar",
        "name": "spark-on-hdfs",
        "className": "com.aliyun.spark.SparkHDFS",
        "conf": {
            "spark.driver.resourceSpec": "medium",
            "spark.executor.instances": 1,
            "spark.executor.resourceSpec": "medium",
            "spark.adb.connectors": "jindo"
        }
    }

    The following table describes the parameters.

    Parameter

    Description

    args

    The arguments that are required to run the Spark JAR job. In this example, you must specify the OSS-HDFS path in the args parameter.

    Example: oss://testBucketName/data/oss_hdfs.

    file

    The OSS-HDFS path of the JAR package.

    Example: oss://testBucketName/data/oss_hdfs_demo.jar.

    name

    The name of the Spark application.

    spark.adb.connectors

    The connector that is used to read OSS-HDFS data. In this example, jindo is used.

    conf

    The configuration parameters that are required for the Spark application, which are similar to the configuration parameters of Apache Spark. The parameters must be in the key: value format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from the configuration parameters of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

  7. Click Run Now. After you run the Spark code, you can click Log in the Actions column on the Applications tab of the Spark JAR Development page to view log information. For more information, see Spark editor.

Read and write OSS-HDFS data in Spark SQL mode

  1. Create a database path and a table path on OSS-HDFS. For more information, see Use Hadoop Shell commands to access OSS-HDFS. In this example, the following paths are created:

    Database path: oss://{bucket}/jindo_test. Table path: oss://{bucket}/jindo_test/tbl.

  2. Write the Spark SQL statements that are used to access OSS-HDFS.

    SET spark.driver.resourceSpec=small;
    SET spark.executor.instances=1;
    SET spark.executor.resourceSpec=small;
    SET spark.adb.connectors=jindo;
    
    CREATE DATABASE IF NOT EXISTS jindo_test LOCATION 'oss://{bucket}/jindo_test';
    USE jindo_test;
    CREATE TABLE IF NOT EXISTS tbl(id int, name string) LOCATION 'oss://{bucket}/jindo_test/tbl';
    INSERT INTO tbl values(1, 'aaa');
    SELECT * FROM tbl;
  3. Click Run Now.

References