The network connection feature of E-MapReduce (EMR) Serverless Spark allows you to access data sources across virtual private clouds (VPCs). This topic describes how to connect to Hive Metastore across VPCs by using a Spark SQL job or a JAR job.
Prerequisites
A data source is prepared. In this example, a DataLake cluster that contains the Hive service is created. Built-in MySQL is selected for Metadata when the cluster is created. For more information, see Create a cluster.
Limits
Only one network connection can be created.
You can use only vSwitches in specific zones to create network connections. The following table lists the zones.
Region name
Region ID
Zone name
Germany (Frankfurt)
eu-central-1
Frankfurt Zone A
Frankfurt Zone B
Indonesia (Jakarta)
ap-southeast-5
Jakarta Zone A
Singapore
ap-southeast-1
Singapore Zone B
Singapore Zone C
US (Virginia)
us-east-1
Virginia Zone A
Virginia Zone B
Step 1: Create a network connection
Go to the Network Connections page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, find the desired workspace and click the name of the workspace.
In the left-side navigation pane of the EMR Serverless Spark page, click Network Connections.
On the Network Connections page, click Create Network Connection.
In the Create Network Connection dialog box, configure parameters and click OK. The following table describes the parameters.
Parameter
Description
Name
The name of the network connection.
VPC
The VPC in which your EMR cluster resides.
If no VPC is available, click Create VPC to create a VPC in the VPC console. For more information, see Create and manage a VPC.
vSwitch
The same vSwitch to which the EMR cluster is connected.
If no vSwitch is available in the current zone, click vSwitch to create a vSwitch in the VPC console. For more information, see Create and manage a vSwitch.
ImportantYou can select only a vSwitch in specific zones. For more information, see the Limits section in this topic.
If Succeeded is displayed in the Status column of the connection, the network connection is created.
Step 2: Configure security group rules for the EMR cluster
Obtain the CIDR block of the vSwitch that you specified when you created the network connection.
You can log on to the VPC console and obtain the CIDR block of the desired vSwitch on the vSwitch page.
Configure security group rules.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
On the EMR on ECS page, find the desired cluster and click the name of the cluster.
In the Security section of the Basic Information tab, click the link to the right of Cluster Security Group.
On the Security Group Details tab of the page that appears, click Add Rule, configure the Port Range and Authorization Object parameters, and then click Save in the Actions column.
Parameter
Description
Port Range
The port number. Enter 9083.
Authorization Object
The CIDR block of the vSwitch obtained in the previous step.
ImportantTo prevent attacks from external users, we recommend that you do not set the Authorization Object parameter to 0.0.0.0/0.
(Optional) Step 3: Connect to Hive and query data in a Hive table
If you have created and configured a Hive table, skip this step.
Log on to the master node of the DataLake cluster in SSH mode. For more information, see Log on to a cluster.
Run the following command to open the Hive CLI:
hive
Execute the following statement to create a table:
CREATE TABLE my_table (id INT,name STRING);
Execute the following statements to insert data into the table:
INSERT INTO my_table VALUES (1, 'John'); INSERT INTO my_table VALUES (2, 'Jane');
Execute the following statement to query data in the table:
SELECT * FROM my_table;
(Optional) Step 4: Prepare and upload a resource file
If you want to use a JAR job, you must prepare a resource file in advance. If you use a Spark SQL job, you can skip this step.
Create an on-premises Maven project.
Run the
mvn package
command to compile and package the files into the sparkDataFrame-1.0-SNAPSHOT.jar package.In the left-side navigation pane of the EMR Serverless Spark page, click Files.
On the Files page, click Upload File.
Upload the
sparkDataFrame-1.0-SNAPSHOT.jar
package.
Step 5: Create and run a job
JAR job
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
Click Create.
In the Create dialog box, enter a name in the Name field, choose
from the Type drop-down list, and then click OK.Configure the parameters that are described in the following table and click Run.
Parameter
Description
Main JAR Resource
Enter the path of the resource file that you uploaded in the previous step. In this example, enter the path of the sparkDataFrame-1.0-SNAPSHOT.jar package.
Main Class
The main class that is specified when the Spark job is submitted. In this example, enter com.example.DataFrameExample.
Spark Configuration
Enter the Spark configurations. The following code provides an example of Spark configurations:
spark.hadoop.hive.metastore.uris thrift://*.*.*.*:9083 spark.hadoop.hive.imetastoreclient.factory.class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory spark.emr.serverless.network.service.name <yourConnectionName>
Replace the following information based on your business requirements:
*.*.*.*
indicate the internal IP address of Hive Metastore. In this example, the internal IP address of the master node of the EMR cluster is used. You can view the internal IP address on the Nodes tab of the EMR cluster.Replace
<yourConnectionName>
with the network connection name that you specified in Step 1.
After you run the job, on the Execution Records tab, click Details in the Actions column of the job.
View the logs. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Job History. On the Job History page, click the Development Job Runs tab. Then, find the desired job run and click the name of the job run. On the page that appears, click the Log Exploration tab.
Spark SQL job
Create and start an SQL session. For more information, see Manage SQL sessions.
You must configure the following parameters for Spark Configuration.
spark.hadoop.hive.metastore.uris thrift://*.*.*.*:9083 spark.hadoop.hive.imetastoreclient.factory.class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory spark.emr.serverless.network.service.name <yourConnectionName>
Replace the following information based on your business requirements:
Replace
*.*.*.*
with the internal IP address of Hive Metastore. In this example, the internal IP address of the master node of the EMR cluster is used. You can view the internal IP address on the Nodes tab of the EMR cluster.Replace
<yourConnectionName>
with the network connection name that you specified in Step 1.
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
Click Create.
In the Create dialog box, enter a name in the Name field, choose
from the Type drop-down list, and then click OK.Select a catalog from the Default Catalog drop-down list, a database from the Default Database drop-down list, and a started SQL session from the SQL Sessions drop-down list. Then, enter the following command in the code editor and click Run.
SELECT * FROM default.my_table;
NoteIf you want to deploy external Metastore-based SQL code to a workflow, make sure that the table name is specified in the
db.table_name
format in your SQL statement. You must also select a default database from the Default Catalog drop-down list in the upper-right corner of the Development page. The catalog must be in thecatalog_id.default
format.You can view the returned information on the Execution Results tab in the lower part of the page.