This topic describes how to use AnalyticDB for MySQL Spark to access data in ApsaraDB for MongoDB.
Prerequisites
Your AnalyticDB for MySQL cluster's product edition is Enterprise Edition, Basic Edition, or Data Lakehouse Edition.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
You have created an ApsaraDB for MongoDB instance in the same region as your AnalyticDB for MySQL cluster. You have also created a database and a collection in the ApsaraDB for MongoDB instance and written data to the database. For more information, see Getting started.
You have added the IP address of the vSwitch that is associated with the ApsaraDB for MongoDB instance to a whitelist for the instance.
NoteYou can view the vSwitch ID on the Basic Information page of the ApsaraDB for MongoDB instance in the ApsaraDB for MongoDB console. Log on to the Virtual Private Cloud (VPC) console to view the IP address of the target vSwitch.
You have added the ApsaraDB for MongoDB instance to a security group. The inbound and outbound rules of this security group allow access requests on the ports of the ApsaraDB for MongoDB instance. For more information, see Add a security group and Add a security group rule.
Procedure
Download the required JAR packages for AnalyticDB for MySQL Spark to access ApsaraDB for MongoDB: mongo-spark-connector_2.12-10.1.1.jar, mongodb-driver-sync-4.8.2.jar, bson-4.8.2.jar, bson-record-codec-4.8.2.jar, and mongodb-driver-core-4.8.2.jar.
Add the following dependency to the pom.xml file.
<dependency> <groupId>org.mongodb.spark</groupId> <artifactId>mongo-spark-connector_2.12</artifactId> <version>10.1.1</version> </dependency>Write, compile, and package a program to access ApsaraDB for MongoDB. In this topic, the generated JAR package is named
spark-mongodb.jar.package com.aliyun.spark import org.apache.spark.sql.SparkSession object SparkOnMongoDB { def main(args: Array[String]): Unit = { // The VPC endpoint of the ApsaraDB for MongoDB instance. You can view the endpoint on the Database Connection page in the ApsaraDB for MongoDB console. val connectionUri = args(0) // The name of the database. val database = args(1) // The name of the collection. val collection = args(2) val spark = SparkSession.builder() .appName("MongoSparkConnectorIntro") .config("spark.mongodb.read.connection.uri", connectionUri) .config("spark.mongodb.write.connection.uri", connectionUri) .getOrCreate() val df = spark.read.format("mongodb").option("database", database).option("collection", collection).load() df.show() spark.stop() } }NoteFor more information about Spark configurations for MongoDB, see Configuration Options. For more code examples, see Write to MongoDB and Read from MongoDB.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, choose .
In the left-side navigation pane, choose .
In the editor, enter the following job content.
ImportantYou can use AnalyticDB for MySQL Spark to access ApsaraDB for MongoDB over a VPC or the internet.
We recommend that you use a VPC for access.
{ "args": [ -- The VPC endpoint of the ApsaraDB for MongoDB instance. You can view the endpoint on the Database Connection page in the ApsaraDB for MongoDB console. "mongodb://<username>:<password>@<host1>:<port1>,<host2>:<port2>,...,<hostN>:<portN>/<database_name>", -- The name of the database. "<database_name>", -- The name of the collection. "<collection_name>" ], "file": "oss://<bucket_name>/spark-mongodb.jar", "jars": [ "oss://<bucket_name>/mongo-spark-connector_2.12-10.1.1.jar", "oss://<bucket_name>/mongodb-driver-sync-4.8.2.jar", "oss://<bucket_name>/bson-4.8.2.jar", "oss://<bucket_name>/bson-record-codec-4.8.2.jar", "oss://<bucket_name>/mongodb-driver-core-4.8.2.jar" ], "name": "MongoSparkConnectorIntro", "className": "com.aliyun.spark.SparkOnMongoDB", "conf": { "spark.driver.resourceSpec": "medium", "spark.executor.instances": 2, "spark.executor.resourceSpec": "medium", "spark.adb.eni.enabled": "true", "spark.adb.eni.vswitchId": "vsw-bp14pj8h0****", "spark.adb.eni.securityGroupId": "sg-bp11m93k021tp****" } }The following table describes the parameters.
Parameter
Description
argsThe arguments that are required for the use of the JAR packages. Specify the arguments based on your business requirements. Separate multiple arguments with commas (,).
fileThe OSS path of the sample program
spark-mongodb.jar.jarsThe OSS path of the JAR packages on which Spark depends to access MongoDB.
nameThe name of the Spark job.
classNameThe entry class of the Java or Scala program. The entry class is not required for a Python application.
spark.adb.eni.enabledSpecifies whether to enable Elastic Network Interface (ENI) access.
You must enable ENI access when you use Data Lakehouse Edition Spark to access the MongoDB data source.
spark.adb.eni.vswitchIdThe vSwitch ID. You can obtain the vSwitch ID from the Basic Information page of the target ApsaraDB for MongoDB instance in its console.
spark.adb.eni.securityGroupIdThe ID of the security group that is added to the ApsaraDB for MongoDB instance. If no security group is added, see Add a security group.
Other conf parameters
The configuration parameters that are required for the Spark job, which are similar to those of Apache Spark. The parameters must be in the
key:valueformat. Separate multiple parameters with commas (,). For more information, see Spark application configuration parameters.Click Execute Now.
After the status of the application in the Application List changes to Completed, click Log in the Actions column to view the data in the MongoDB table.