Configure network connectivity between EMR Serverless Spark and a data source across VPCs - E-MapReduce

The network connection feature of E-MapReduce (EMR) Serverless Spark allows you to access data sources across virtual private clouds (VPCs). This topic describes how to connect to Hive Metastore across VPCs by using a Spark SQL job or a JAR job.

Prerequisites

A data source is prepared. In this example, a DataLake cluster that contains the Hive service is created. Built-in MySQL is selected for Metadata when the cluster is created. For more information, see Create a cluster.

Limits

You can use only vSwitches in specific zones to create network connections. The following table lists the zones.

Step 1: Create a network connection

Go to the Network Connections page.
1. Log on to the EMR console.
2. In the left-side navigation pane, choose EMR Serverless > Spark.
3. On the Spark page, find the desired workspace and click the name of the workspace.
4. In the left-side navigation pane of the EMR Serverless Spark page, click Network Connections.
On the Network Connections page, click Create Network Connection.

In the Create Network Connection dialog box, configure parameters and click OK. The following table describes the parameters.

Parameter	Description
Name	The name of the network connection.
VPC	The VPC in which your EMR cluster resides. If no VPC is available, click Create VPC to create a VPC in the VPC console. For more information, see Create and manage a VPC.
vSwitch	The same vSwitch to which the EMR cluster is connected. If no vSwitch is available in the current zone, click vSwitch to create a vSwitch in the VPC console. For more information, see Create and manage a vSwitch. Important You can select only a vSwitch in specific zones. For more information, see the Limits section in this topic.

If Succeeded is displayed in the Status column of the connection, the network connection is created.

Step 2: Configure security group rules for the EMR cluster

Obtain the CIDR block of the vSwitch that you specified when you created the network connection.
You can log on to the VPC console and obtain the CIDR block of the desired vSwitch on the vSwitch page.

Configure security group rules.

Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
On the EMR on ECS page, find the desired cluster and click the name of the cluster.
In the Security section of the Basic Information tab, click the link to the right of Cluster Security Group.

On the Security Group Details tab of the page that appears, click Add Rule, configure the Port Range and Authorization Object parameters, and then click Save in the Actions column.

Parameter	Description
Port Range	The port number. Enter 9083.
Authorization Object	The CIDR block of the vSwitch obtained in the previous step. Important To prevent attacks from external users, we recommend that you do not set the Authorization Object parameter to 0.0.0.0/0.

(Optional) Step 3: Connect to Hive and query data in a Hive table

If you have created and configured a Hive table, skip this step.

Log on to the master node of the DataLake cluster in SSH mode. For more information, see Log on to a cluster.
Run the following command to open the Hive CLI:
```
hive
```
Execute the following statement to create a table:
```
CREATE TABLE my_table (id INT,name STRING);
```

Execute the following statements to insert data into the table:

INSERT INTO my_table VALUES (1, 'John'); 
INSERT INTO my_table VALUES (2, 'Jane');

Execute the following statement to query data in the table:
```
SELECT * FROM my_table;
```

(Optional) Step 4: Prepare and upload a resource file

If you want to use a JAR job, you must prepare a resource file in advance. If you use a Spark SQL job, you can skip this step.

Create an on-premises Maven project.

Sample code of the project

package com.example;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class DataFrameExample {
    public static void main(String[] args) {
        // Create a Spark session. 
        SparkSession spark = SparkSession.builder()
                .appName("HMSQueryExample")
                .enableHiveSupport()
                .getOrCreate();

        // Execute a query statement. 
        Dataset<Row> result = spark.sql("SELECT * FROM default.my_table");

        // Print the query results. 
        result.show();

        // Close the Spark session. 
        spark.stop();
    }
}

Sample code of the POM file

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>sparkDataFrame</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <spark.version>3.3.1</spark.version>
        <scala.binary.version>2.12</scala.binary.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
</project>

Run the mvn package command to compile and package the files into the sparkDataFrame-1.0-SNAPSHOT.jar package.
In the left-side navigation pane of the EMR Serverless Spark page, click Files.
On the Files page, click Upload File.
Upload the sparkDataFrame-1.0-SNAPSHOT.jar package.

Step 5: Create and run a job

JAR job

In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
Click Create.
In the Create dialog box, enter a name in the Name field, choose Batch Job > JAR from the Type drop-down list, and then click OK.

Configure the parameters that are described in the following table and click Run.

Parameter	Description
Main JAR Resource	Enter the path of the resource file that you uploaded in the previous step. In this example, enter the path of the sparkDataFrame-1.0-SNAPSHOT.jar package.
Main Class	The main class that is specified when the Spark job is submitted. In this example, enter com.example.DataFrameExample.
Spark Configuration	Enter the Spark configurations. The following code provides an example of Spark configurations: `spark.hadoop.hive.metastore.uris thrift://...:9083 spark.hadoop.hive.imetastoreclient.factory.class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory spark.emr.serverless.network.service.name <yourConnectionName>` Replace the following information based on your business requirements: `...` indicate the internal IP address of Hive Metastore. In this example, the internal IP address of the master node of the EMR cluster is used. You can view the internal IP address on the Nodes tab of the EMR cluster. Replace `<yourConnectionName>` with the network connection name that you specified in Step 1.

After you run the job, on the Execution Records tab, click Details in the Actions column of the job.
View the logs. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Job History. On the Job History page, click the Development Job Runs tab. Then, find the desired job run and click the name of the job run. On the page that appears, click the Log Exploration tab.

Spark SQL job

Create and start an SQL session. For more information, see Manage SQL sessions.
You must configure the following parameters for Spark Configuration.
```
spark.hadoop.hive.metastore.uris thrift://*.*.*.*:9083
spark.hadoop.hive.imetastoreclient.factory.class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory
spark.emr.serverless.network.service.name <yourConnectionName>
```
Replace the following information based on your business requirements:
- Replace *.*.*.* with the internal IP address of Hive Metastore. In this example, the internal IP address of the master node of the EMR cluster is used. You can view the internal IP address on the Nodes tab of the EMR cluster.
- Replace <yourConnectionName> with the network connection name that you specified in Step 1.
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
Click Create.
In the Create dialog box, enter a name in the Name field, choose SQL > SparkSQL from the Type drop-down list, and then click OK.
Select a catalog from the Default Catalog drop-down list, a database from the Default Database drop-down list, and a started SQL session from the SQL Sessions drop-down list. Then, enter the following command in the code editor and click Run.
```
SELECT * FROM default.my_table;
```
Note
If you want to deploy external Metastore-based SQL code to a workflow, make sure that the table name is specified in the db.table_name format in your SQL statement. You must also select a default database from the Default Catalog drop-down list in the upper-right corner of the Development page. The catalog must be in the catalog_id.default format.
You can view the returned information on the Execution Results tab in the lower part of the page.