Use Spark SQL to read Lindorm data - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for MySQL allows you to use Spark SQL to access Lindorm data. This topic describes how to use Spark SQL to access data of Hive and wide tables in Lindorm.

Prerequisites

A single-zone Lindorm instance is created. For more information, see Create an instance.
Important
You must activate LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS for the Lindorm instance. For information about how to activate LDPS, see Activate LDPS and modify the configurations.
OSS is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS, Create a bucket, and Create a directory.
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster.

Preparations

Log on to the Lindorm console and go to the Instance Details page to view the vSwitch ID of the Lindorm instance.
Log on to the Lindorm console and go to the Access Control page. Click the Security Groups tab to view the ID of the security group to which the Lindorm instance belongs. For information about how to add a security group, see Add a security group.

Access the data of a Hive table in Lindorm

Obtain the hdfs-site configuration information of your Hadoop Distributed File System (HDFS) client.
Note
To obtain the hdfs-site configuration information, contact Lindorm technical support (DingTalk ID: s0s3eg3).
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and a job resource group.

On the SQLConsole tab, enter the following code:

SET spark.adb.connectors = oss,external_hive;
SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog;
SET spark.adb.eni.adbHostAlias.enabled=true;
SET spark.sql.catalogImplementation = hive;
-- Configure HDFS-related parameters.
SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083;
SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****;
SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
SET spark.hadoop.dfs.ha.automatic-failover.enabled=true;
SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2;
SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020;
SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020;
SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071;
SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071;

-- Create an external database and an external table in OSS.
CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db';
CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test';

-- Insert the data of a Hive table into the OSS external table.
INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test;

-- Read the data of the OSS external table.
SELECT * FROM external_oss_db.test;

Parameter	Required	Description
spark.adb.connectors	Yes	The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, and jindo. In this example, `oss and external_hive` are used.
spark.adb.eni.enabled	Yes	Specifies whether to enable Elastic Network Interface (ENI). If you use external tables to access other external data sources, you must enable ENI. Valid values: true false In this example, set this parameter to `true`.
spark.adb.eni.vswitchId	Yes	The vSwitch ID of the Lindorm instance that is obtained during preparations.
spark.adb.eni.securityGroupId	Yes	The security group ID of the Lindorm instance that is obtained during preparations.
spark.sql.catalog.lindorm_catalog	Yes	The method that is used by Spark SQL to configure data sources. This parameter is automatically set to `org.apache.spark.sql.hive.V2HiveExternalCatalog`. Note You can configure a custom `catalog_name` value for the spark.sql.catalog.catalog_name parameter. In this example, `lindorm_catalog` is used.
spark.adb.eni.adbHostAlias.enabled	Yes	Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values: true false If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.
HDFS-related parameters	Yes	The hdfs-site configuration information that is obtained in Step 1. Format: `set spark.hadoop.<name>=<value>`. Note Replace the <name> and <value> parameters with the actual values in the configuration information.
location	Yes	The OSS path of the external table. Format: `oss://<bucketname/database/table>`. In this example, `oss://testBucketName/warehouse/db/test` is used. Note Replace the `<bucketname/database/table>` parameter with the actual OSS path.

Click Execute.
After you run the Spark job, go to the Spark JAR Development page and click Logs in the Actions column on the Applications tab to view the table data. For more information, see Spark editor.

Access the data of a wide table in Lindorm

Obtain the Virtual Private Cloud (VPC) endpoint of LindormTable for access by using the HBase Java API. For more information, see the " View the endpoints of LindormTable" section of the View endpoints topic.
Download the JAR package that is required to connect to the Lindorm instance. Decompress and upload the JAR package to OSS. For more information, see the "Upload an object" section of the Get started by using the OSS console topic.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and a job resource group.

On the SQLConsole tab, enter the following code:

SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog;
SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020;
SET spark.sql.catalog.lindorm_table.username=test;
SET spark.sql.catalog.lindorm_table.password=password;
add jar "oss://<bucketname>/hbase_jars/*";

-- Create an external database and an external table in OSS.
CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db';
CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2';

-- Insert the data of a wide table into the OSS external table.
INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb;

-- Read the data of the OSS external table.
SELECT * FROM external_oss_db.lindorm_table_2;

The following table describes the parameters.

Parameter	Required	Description
spark.adb.eni.enabled	Yes	Specifies whether to enable Elastic Network Interface (ENI). If you use external tables to access other external data sources, you must enable ENI. Valid values: true false In this example, set this parameter to `true`.
spark.adb.eni.vswitchId	Yes	The vSwitch ID of the Lindorm instance that is obtained during preparations.
spark.adb.eni.securityGroupId	Yes	The security group ID of the Lindorm instance that is obtained during preparations.
spark.sql.catalog.lindorm_table	Yes	The method that is used by Spark SQL to configure data sources. This parameter is automatically set to `com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog`. Note You can configure a custom `catalog_name` value for the spark.sql.catalog.catalog_name parameter. In this example, `lindorm_table` is used.
spark.sql.catalog.lindorm_table.url	Yes	The VPC endpoint for access by using the HBase Java API that is obtained in Step 1.
spark.sql.catalog.lindorm_table.username	Yes	The username and password of LindormTable. If you forget your password, you can change the password in the cluster management system of LindormTable.
spark.sql.catalog.lindorm_table.password	Yes
jar	Yes	The OSS path of the package that is downloaded in Step 2. In this example, `oss://testBucketName/hbase_jars/*` is used.
location	Yes	The OSS path of the external table. Format: `oss://<bucketname/database/table>`. In this example, `oss://testBucketName/warehouse/db/lindorm_table_2` is used. Note Replace the `<bucketname/database/table>` parameter with the actual OSS path.

Click Execute.
After you run the Spark job, go to the Spark JAR Development page and click Logs in the Actions column on the Applications tab to view the table data. For more information, see Spark editor.