All Products
Search
Document Center

AnalyticDB:Use Spark SQL to read Lindorm data

Last Updated:Sep 05, 2024

AnalyticDB for MySQL allows you to use Spark SQL to access Lindorm data. This topic describes how to use Spark SQL to access data of Hive and wide tables in Lindorm.

Prerequisites

  • A single-zone Lindorm instance is created. For more information, see Create an instance.

    Important

    You must activate LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS for the Lindorm instance. For information about how to activate LDPS, see Activate LDPS and modify the configurations.

  • OSS is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS, Create a bucket, and Create a directory.

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • A job resource group is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster. For more information, see Create a resource group.

  • A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster.

Preparations

  1. Log on to the Lindorm console and go to the Instance Details page to view the vSwitch ID of the Lindorm instance.

  2. Log on to the Lindorm console and go to the Access Control page. Click the Security Groups tab to view the ID of the security group to which the Lindorm instance belongs. For information about how to add a security group, see Add a security group.

Access the data of a Hive table in Lindorm

  1. Obtain the hdfs-site configuration information of your Hadoop Distributed File System (HDFS) client.

    Note

    To obtain the hdfs-site configuration information, contact Lindorm technical support (DingTalk ID: s0s3eg3).

  2. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  3. In the left-side navigation pane, choose Job Development > SQL Development.

  4. On the SQLConsole tab, select the Spark engine and a job resource group.

  5. On the SQLConsole tab, enter the following code:

    SET spark.adb.connectors = oss,external_hive;
    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
    SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
    SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog;
    SET spark.adb.eni.adbHostAlias.enabled=true;
    SET spark.sql.catalogImplementation = hive;
    -- Configure HDFS-related parameters.
    SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083;
    SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****;
    SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
    SET spark.hadoop.dfs.ha.automatic-failover.enabled=true;
    SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2;
    SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020;
    SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020;
    SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071;
    SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071;
    
    -- Create an external database and an external table in OSS.
    CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db';
    CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test';
    
    -- Insert the data of a Hive table into the OSS external table.
    INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test;
    
    -- Read the data of the OSS external table.
    SELECT * FROM external_oss_db.test;

    Parameter

    Required

    Description

    spark.adb.connectors

    Yes

    The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, and jindo.

    In this example, oss and external_hive are used.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable Elastic Network Interface (ENI).

    If you use external tables to access other external data sources, you must enable ENI. Valid values:

    • true

    • false

    In this example, set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the Lindorm instance that is obtained during preparations.

    spark.adb.eni.securityGroupId

    Yes

    The security group ID of the Lindorm instance that is obtained during preparations.

    spark.sql.catalog.lindorm_catalog

    Yes

    The method that is used by Spark SQL to configure data sources.

    This parameter is automatically set to org.apache.spark.sql.hive.V2HiveExternalCatalog.

    Note

    You can configure a custom catalog_name value for the spark.sql.catalog.catalog_name parameter. In this example, lindorm_catalog is used.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values:

    • true

    • false

    If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.

    HDFS-related parameters

    Yes

    The hdfs-site configuration information that is obtained in Step 1. Format: set spark.hadoop.<name>=<value>.

    Note

    Replace the <name> and <value> parameters with the actual values in the configuration information.

    location

    Yes

    The OSS path of the external table. Format: oss://<bucketname/database/table>. In this example, oss://testBucketName/warehouse/db/test is used.

    Note

    Replace the <bucketname/database/table> parameter with the actual OSS path.

  6. Click Execute.

  7. After you run the Spark job, go to the Spark JAR Development page and click Logs in the Actions column on the Applications tab to view the table data. For more information, see Spark editor.

Access the data of a wide table in Lindorm

  1. Obtain the Virtual Private Cloud (VPC) endpoint of LindormTable for access by using the HBase Java API. For more information, see the " View the endpoints of LindormTable" section of the View endpoints topic.

  2. Download the JAR package that is required to connect to the Lindorm instance. Decompress and upload the JAR package to OSS. For more information, see the "Upload an object" section of the Get started by using the OSS console topic.

  3. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  4. In the left-side navigation pane, choose Job Development > SQL Development.

  5. On the SQLConsole tab, select the Spark engine and a job resource group.

  6. On the SQLConsole tab, enter the following code:

    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
    SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
    SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog;
    SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020;
    SET spark.sql.catalog.lindorm_table.username=test;
    SET spark.sql.catalog.lindorm_table.password=password;
    add jar "oss://<bucketname>/hbase_jars/*";
    
    -- Create an external database and an external table in OSS.
    CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db';
    CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2';
    
    -- Insert the data of a wide table into the OSS external table.
    INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb;
    
    -- Read the data of the OSS external table.
    SELECT * FROM external_oss_db.lindorm_table_2;

    The following table describes the parameters.

    Parameter

    Required

    Description

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable Elastic Network Interface (ENI).

    If you use external tables to access other external data sources, you must enable ENI. Valid values:

    • true

    • false

    In this example, set this parameter to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the Lindorm instance that is obtained during preparations.

    spark.adb.eni.securityGroupId

    Yes

    The security group ID of the Lindorm instance that is obtained during preparations.

    spark.sql.catalog.lindorm_table

    Yes

    The method that is used by Spark SQL to configure data sources.

    This parameter is automatically set to com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog.

    Note

    You can configure a custom catalog_name value for the spark.sql.catalog.catalog_name parameter. In this example, lindorm_table is used.

    spark.sql.catalog.lindorm_table.url

    Yes

    The VPC endpoint for access by using the HBase Java API that is obtained in Step 1.

    spark.sql.catalog.lindorm_table.username

    Yes

    The username and password of LindormTable.

    If you forget your password, you can change the password in the cluster management system of LindormTable.

    spark.sql.catalog.lindorm_table.password

    Yes

    jar

    Yes

    The OSS path of the package that is downloaded in Step 2. In this example, oss://testBucketName/hbase_jars/* is used.

    location

    Yes

    The OSS path of the external table. Format: oss://<bucketname/database/table>. In this example, oss://testBucketName/warehouse/db/lindorm_table_2 is used.

    Note

    Replace the <bucketname/database/table> parameter with the actual OSS path.

  7. Click Execute.

  8. After you run the Spark job, go to the Spark JAR Development page and click Logs in the Actions column on the Applications tab to view the table data. For more information, see Spark editor.