ApsaraDB for SelectDB supports federated queries and can integrate external data sources such as data lakes, databases, and remote files for convenient and efficient data analysis. This topic describes how to integrate a Hive data source with SelectDB by using a catalog and use the Hive data source for federated analysis.
Prerequisites
The IP addresses of all nodes in your Hive cluster are added to the IP address whitelist of an SelectDB instance. For more information, see Configure an IP address whitelist.
If your Hive cluster is built on a Hadoop Distributed File System (HDFS), the following ports are enabled for the HDFS to transfer data between the Hive cluster and SelectDB:
Port specified by the
hive.metastore.uris
parameter. Default value: 9083.Port specified by the
dfs.namenode.rpc-address
parameter. Default value: 8020.Port specified by the
dfs.datanode.address
parameter. Default value: 9866.
You have basic knowledge of catalogs and know the operations that you can perform on catalogs. For more information, see Data lakehouse.
Usage notes
Hive 1, Hive 2, and Hive 3 are supported.
Managed tables, external tables, and some Hive views are supported.
Hive, Iceberg, and Hudi metadata stored in a Hive metastore can be identified.
ApsaraDB for SelectDB can only read data in external catalogs.
Sample environment
In this example, the Linux operating system is used, and federated queries are performed in ApsaraDB for SelectDB to query the test_db.test_t table in a Hive cluster that is built on an HDFS. You can change the parameter settings in the example based on your operating system and environment. Sample environment parameters:
Business scenario: non-high availability (HA)
Hive cluster type: HDFS
Source database: test_db
Source table: test_t
Prepare the source data
Log on to the Hive cluster whose data you want to query.
Create a database named testdb.
CREATE database if NOT EXISTS test_db;
Create a table named Hive_t.
CREATE TABLE IF NOT EXISTS test_t ( id INT, name STRING, age INT );
Insert data into the table.
--Insert data into the table. INSERT INTO TABLE test_t VALUES (1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35), (4, 'David', 40), (5, 'Eve', 45);
Procedure
Step 1: Connect to an ApsaraDB for SelectDB instance
Connect to an SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance by using a MySQL client.
Step 2: Create a Hive catalog
You can create an external catalog in SelectDB to integrate an external data source. The catalog parameters vary with business scenarios. Select an appropriate syntax and configure the required parameters based on your business scenario.
After you create a catalog, ApsaraDB for SelectDB automatically synchronizes the databases and tables in the catalog. ApsaraDB for SelectDB maps data types of columns in a catalog to ApsaraDB for SelectDB data types. For more information, see the "Column data type mappings" section of this topic.
HDFS-based Hive
Syntax
CREATE CATALOG <catalog_name> PROPERTIES (
'type'='<type>',
'hive.metastore.uris' = '<hive.metastore.uris>',
'hadoop.username' = '<hadoop.username>',
'dfs.nameservices'='<hadoop.username>',
'dfs.ha.namenodes.your-nameservice'='<dfs.ha.namenodes.your-nameservice>',
'dfs.namenode.rpc-address.your-nameservice.nn1'='<dfs.namenode.rpc-address.your-nameservice.nn1>',
'dfs.namenode.rpc-address.your-nameservice.nn2'='<dfs.namenode.rpc-address.your-nameservice.nn2>',
'dfs.client.failover.proxy.provider.your-nameservice'='<dfs.client.failover.proxy.provider.your-nameservice>'
);
Parameters
Non-HA scenario
Parameter | Required | Description |
catalog_name | Yes | The name of the catalog. |
type | Yes | The type of the catalog. Set the value to hms. |
hive.metastore.uris | Yes | The URI of the Hive metastore.
|
HA scenario
Parameter | Required | Description |
catalog_name | Yes | The name of the catalog. |
type | Yes | The type of the catalog. Set the value to hms. |
hive.metastore.uris | Yes | The URI of the Hive metastore.
|
hadoop.username | No | The username used to log on to the HDFS. |
dfs.nameservices | No | The name of the NameService. The value must be the same as that of the dfs.nameservices parameter in the hdfs-site.xml configuration file on which the existing Hive environment depends. |
dfs.ha.namenodes.[nameservice ID] | No | The IDs of the NameNodes. The value must be the same as that of the corresponding parameter in the hdfs-site.xml configuration file on which the existing Hive environment depends. |
dfs.namenode.rpc-address.[nameservice ID].[name node ID] | No | The remote procedure call (RPC) address of the NameNode. The number of RPC addresses must be the same as the number of NameNodes. The value must be the same as that of the corresponding parameter in the hdfs-site.xml configuration file on which the existing Hive environment depends. |
dfs.client.failover.proxy.provider.[nameservice ID] | No | The Java class that implements the active NameNode connection for the HDFS client. In most cases, set the value to |
Sample command for the non-HA scenario
CREATE CATALOG hive_catalog PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://master-1-1.c-7fa25a1a****.cn-hangzhou.emr.aliyuncs.com:9083'
);
OSS-based Hive
Syntax
The syntax for creating a Hive catalog based on Object Storage Service (OSS) is the same as that for creating a Hive catalog based on an HDFS. The only difference is that they use different required parameters.
Parameters
Parameter | Required | Description |
type | Yes | The type of the catalog. Set the value to hms. |
hive.metastore.uris | Yes | The URI of the Hive metastore.
|
oss.endpoint | Yes | The |
oss.access_key | Yes | The |
oss.secret_key | Yes | The |
Sample command
CREATE CATALOG hive_catalog PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
"oss.endpoint" = "oss-cn-beijing.aliyuncs.com",
"oss.access_key" = "ak",
"oss.secret_key" = "sk"
);
Step 3: View the catalog
You can execute the following statement to check whether the catalog is created:
SHOW CATALOGS; --Check whether the catalog is created.
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | hive_catalog | hms | | 2024-07-19 17:09:08.058 | 2024-07-19 18:04:37 | |
| 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
Step 4: View data in Hive databases and tables
View the Hive databases and tables in the external catalog that you created.
NoteAfter you connect to an ApsaraDB for SelectDB instance, the internal catalog is used by default.
Switch to the external catalog that you want to manage.
SWITCH hive_catalog;
View the data.
After you switch to the external catalog, you can access the data in the external catalog in the same way as you use the internal catalog. Examples:
To view the database list, execute the
SHOW DATABASES;
statement.To switch to the specified database, execute the
USE test_db;
statement.To view the table list, execute the
SHOW TABLES;
statement.To view the data in the specified table, execute the
SELECT * FROM test_t;
statement.
View the Hive databases and tables in the internal catalog.
--View the data in the test_t table of the test_db database in the hive_catalog catalog. SELECT * FROM hive_catalog.test_db.test_t;
Related operation: Migrate data
After the Hive data source is integrated, you can execute INSERT INTO statements to migrate historical data from the Hive data source to SelectDB. For more information, see Import data by using INSERT INTO statements.
Column data type mappings
The following Hive metastore data types are applicable to Hive, Iceberg, and Hudi.
Some Hive metastore and ApsaraDB for SelectDB data types with complex structures can be nested. Nesting examples:
array<type>:
array<map<string, int>>
map<KeyType, ValueType>:
map<string, array<int>>
struct<col1: Type1, col2: Type2, ...>:
struct<col1: array<int>, col2: map<int, date>>
Hive metastore data type | ApsaraDB for SelectDB data type |
boolean | boolean |
tinyint | tinyint |
smallint | smallint |
int | int |
bigint | bigint |
date | date |
timestamp | datetime |
float | float |
double | double |
char | char |
varchar | varchar |
decimal | decimal |
array<type> | array<type> |
map<KeyType, ValueType> | map<KeyType, ValueType> |
struct<col1: Type1, col2: Type2, ...> | struct<col1: Type1, col2: Type2, ...> |
other | unsupported |