Access MongoDB data - AnalyticDB - Alibaba Cloud Documentation Center

This topic describes how to use AnalyticDB for MySQL Spark to access data in ApsaraDB for MongoDB.

Prerequisites

Your AnalyticDB for MySQL cluster's product edition is Enterprise Edition, Basic Edition, or Data Lakehouse Edition.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
A job resource group is created.
You have created an ApsaraDB for MongoDB instance in the same region as your AnalyticDB for MySQL cluster. You have also created a database and a collection in the ApsaraDB for MongoDB instance and written data to the database. For more information, see Getting started.
You have added the IP address of the vSwitch that is associated with the ApsaraDB for MongoDB instance to a whitelist for the instance.
Note
You can view the vSwitch ID on the Basic Information page of the ApsaraDB for MongoDB instance in the ApsaraDB for MongoDB console. Log on to the Virtual Private Cloud (VPC) console to view the IP address of the target vSwitch.
You have added the ApsaraDB for MongoDB instance to a security group. The inbound and outbound rules of this security group allow access requests on the ports of the ApsaraDB for MongoDB instance. For more information, see Add a security group and Add a security group rule.

Procedure

Download the required JAR packages for AnalyticDB for MySQL Spark to access ApsaraDB for MongoDB: mongo-spark-connector_2.12-10.1.1.jar, mongodb-driver-sync-4.8.2.jar, bson-4.8.2.jar, bson-record-codec-4.8.2.jar, and mongodb-driver-core-4.8.2.jar.

Add the following dependency to the pom.xml file.

  <dependency>
    <groupId>org.mongodb.spark</groupId>
    <artifactId>mongo-spark-connector_2.12</artifactId>
    <version>10.1.1</version>
  </dependency>

Write, compile, and package a program to access ApsaraDB for MongoDB. In this topic, the generated JAR package is named spark-mongodb.jar.

package com.aliyun.spark

import org.apache.spark.sql.SparkSession

object SparkOnMongoDB {
  def main(args: Array[String]): Unit = {
    // The VPC endpoint of the ApsaraDB for MongoDB instance. You can view the endpoint on the Database Connection page in the ApsaraDB for MongoDB console.
    val connectionUri = args(0)
    // The name of the database.
    val database = args(1)
    // The name of the collection.
    val collection = args(2)
    
    val spark = SparkSession.builder()
      .appName("MongoSparkConnectorIntro")
      .config("spark.mongodb.read.connection.uri", connectionUri)
      .config("spark.mongodb.write.connection.uri", connectionUri)
      .getOrCreate()

    val df = spark.read.format("mongodb").option("database", database).option("collection", collection).load()
    
    df.show()
    
    spark.stop()
  }
}

Note

For more information about Spark configurations for MongoDB, see Configuration Options. For more code examples, see Write to MongoDB and Read from MongoDB.

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, choose Job Development > Spark JAR Development.
In the left-side navigation pane, choose Job Development > Spark JAR Development.

In the editor, enter the following job content.

Important

You can use AnalyticDB for MySQL Spark to access ApsaraDB for MongoDB over a VPC or the internet.
We recommend that you use a VPC for access.

{
  "args": [
    -- The VPC endpoint of the ApsaraDB for MongoDB instance. You can view the endpoint on the Database Connection page in the ApsaraDB for MongoDB console.
	  "mongodb://<username>:<password>@<host1>:<port1>,<host2>:<port2>,...,<hostN>:<portN>/<database_name>",
    -- The name of the database.
    "<database_name>",
    -- The name of the collection.
    "<collection_name>"
	],
  "file": "oss://<bucket_name>/spark-mongodb.jar",
  "jars": [
    "oss://<bucket_name>/mongo-spark-connector_2.12-10.1.1.jar",
    "oss://<bucket_name>/mongodb-driver-sync-4.8.2.jar",
    "oss://<bucket_name>/bson-4.8.2.jar",
    "oss://<bucket_name>/bson-record-codec-4.8.2.jar",
    "oss://<bucket_name>/mongodb-driver-core-4.8.2.jar"
	],
  "name": "MongoSparkConnectorIntro",
  "className": "com.aliyun.spark.SparkOnMongoDB",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.executor.resourceSpec": "medium",
    "spark.adb.eni.enabled": "true",
    "spark.adb.eni.vswitchId": "vsw-bp14pj8h0****",
    "spark.adb.eni.securityGroupId": "sg-bp11m93k021tp****"
  }
}

The following table describes the parameters.

Parameter	Description
`args`	The arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).
`file`	The OSS path of the sample program `spark-mongodb.jar`.
`jars`	The OSS path of the JAR packages on which Spark depends to access MongoDB.
`name`	The name of the Spark job.
`className`	The entry class of the Java or Scala program. The entry class is not required for a Python application.
`spark.adb.eni.enabled`	Specifies whether to enable Elastic Network Interface (ENI) access. You must enable ENI access when you use Data Lakehouse Edition Spark to access the MongoDB data source.
`spark.adb.eni.vswitchId`	The vSwitch ID. You can obtain the vSwitch ID from the Basic Information page of the target ApsaraDB for MongoDB instance in its console.
`spark.adb.eni.securityGroupId`	The ID of the security group that is added to the ApsaraDB for MongoDB instance. If no security group is added, see Add a security group.
Other conf parameters	The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.

Click Execute Now.
After the status of the application in the Application List changes to Completed, click Log in the Actions column to view the data in the MongoDB table.