Use Spark to read and write OSS-HDFS data - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark allows you to access OSS-HDFS. This topic describes how to use Spark to access OSS-HDFS.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created in the same region as an Object Storage Service (OSS) bucket.
A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
- If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
- If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources. For more information, see Perform authorization.

OSS-HDFS is enabled. For more information, see Enable OSS-HDFS and grant access permissions.

Read and write OSS-HDFS data in Spark JAR mode

Write a program that is used to access OSS-HDFS. Then, compile the program into a JAR package that is required for the Spark job. In this example, the JAR package is named oss_hdfs_demo.jar. Sample code:

package com.aliyun.spark

import org.apache.spark.sql.SparkSession

object SparkHDFS {
  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession
      .builder()
      .appName("Spark HDFS TEST")
      .getOrCreate()

    val welcome = "hello, adb-spark"

    // Specify the Hadoop Distributed File System (HDFS) directory to store required data.
    val hdfsPath = args(0);
    // Store the welcome string to the specified HDFS directory.
    sparkSession.sparkContext.parallelize(Seq(welcome)).saveAsTextFile(hdfsPath)
    // Read data from the specified HDFS directory and display the data.
    sparkSession.sparkContext.textFile(hdfsPath).collect.foreach(println)
  }
}

Upload the oss_hdfs_demo.jar package to OSS-HDFS. For more information, see Use Hadoop Shell commands to access OSS-HDFS.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
In the upper part of the editor, select a job resource group and a Spark application type. In this example, the Batch type is selected.

Run the following Spark code in the editor. Display the total number of lines and the content of the first line in the text file.

{
    "args": ["oss://testBucketName/data/oss_hdfs"],
    "file": "oss://testBucketName/data/oss_hdfs_demo.jar",
    "name": "spark-on-hdfs",
    "className": "com.aliyun.spark.SparkHDFS",
    "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        "spark.adb.connectors": "jindo"
    }
}

The following table describes the parameters.

Parameter	Description
args	The arguments that are required to run the Spark JAR job. In this example, you must specify the OSS-HDFS path in the args parameter. Example: `oss://testBucketName/data/oss_hdfs`.
file	The OSS-HDFS path of the JAR package. Example: `oss://testBucketName/data/oss_hdfs_demo.jar`.
name	The name of the Spark application.
spark.adb.connectors	The connector that is used to read OSS-HDFS data. In this example, `jindo` is used.
conf	The configuration parameters that are required for the Spark application, which are similar to the configuration parameters of Apache Spark. The parameters must be in the `key: value` format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from the configuration parameters of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

Click Run Now. After you run the Spark code, you can click Log in the Actions column on the Applications tab of the Spark JAR Development page to view log information. For more information, see Spark editor.

Read and write OSS-HDFS data in Spark SQL mode

Create a database path and a table path on OSS-HDFS. For more information, see Use Hadoop Shell commands to access OSS-HDFS. In this example, the following paths are created:
Database path: oss://{bucket}/jindo_test. Table path: oss://{bucket}/jindo_test/tbl.

Write the Spark SQL statements that are used to access OSS-HDFS.

SET spark.driver.resourceSpec=small;
SET spark.executor.instances=1;
SET spark.executor.resourceSpec=small;
SET spark.adb.connectors=jindo;

CREATE DATABASE IF NOT EXISTS jindo_test LOCATION 'oss://{bucket}/jindo_test';
USE jindo_test;
CREATE TABLE IF NOT EXISTS tbl(id int, name string) LOCATION 'oss://{bucket}/jindo_test/tbl';
INSERT INTO tbl values(1, 'aaa');
SELECT * FROM tbl;

Click Run Now.

References

For information about how to read and write Hudi external tables, see Use Spark SQL to read and write Hudi external tables.
For information about how to read and write Delta external tables, see Use Spark SQL to read and write Delta external tables.