Hive data source

ApsaraDB for SelectDB supports federated queries and can integrate external data sources such as data lakes, databases, and remote files for convenient and efficient data analysis. This topic describes how to integrate a Hive data source with SelectDB by using a catalog to perform federated analysis on the Hive data source.

Prerequisites

All nodes in your Hive cluster are connected to the SelectDB instance.
- All nodes in the Hive cluster reside in the same virtual private cloud (VPC) as the SelectDB instance. If the nodes in the data source cluster reside in different VPCs, you must connect the nodes to the SelectDB instance. For more information, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
- The IP addresses of all nodes in the Hive cluster are added to the IP address whitelist of the SelectDB instance. For more information, see Configure an IP address whitelist.
- The IP addresses in the VPC in which the SelectDB instance resides are added to an IP address whitelist of the Hive cluster if the whitelist mechanism is supported for the Hive cluster.
  - To obtain the IP address of the SelectDB instance in the VPC to which the SelectDB instance belongs, you can perform the operations provided in How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
  - To obtain the public IP address of the SelectDB instance, you can run the ping command to ping the public IP address of the SelectDB instance.
If your Hive cluster is built on a Hadoop Distributed File System (HDFS), the following ports are enabled for the HDFS to transfer data between the Hive cluster and SelectDB:
- Port specified by the hive.metastore.uris parameter. Default value: 9083.
- Port specified by the dfs.namenode.rpc-address parameter. Default value: 8020.
- Port specified by the dfs.datanode.address parameter. Default value: 9866.
You have basic knowledge of catalogs and understand the operations that you can perform on catalogs. For more information, see Data lakehouse.

Usage notes

Hive 1, Hive 2, and Hive 3 are supported.
Managed tables, external tables, and some Hive views are supported.
Hive, Iceberg, and Hudi metadata stored in a Hive metastore can be identified.
ApsaraDB for SelectDB can only read data in external catalogs.

Sample environment

In this example, the Linux operating system is used, and federated queries are performed in ApsaraDB for SelectDB to query the test_db.test_t table in a Hive cluster that is built on an HDFS. You can change the parameter settings in the example based on your operating system and environment. Sample environment parameters:

Business scenario: non-high availability (HA)
Hive cluster type: HDFS
Source database: test_db
Source table: test_t

Prepare the source data

Log on to the Hive cluster whose data you want to query.
Create a database named testdb.
```
CREATE database if NOT EXISTS test_db;
```

Create a table named Hive_t.

CREATE TABLE IF NOT EXISTS test_t (
    id INT,
    name STRING,
    age INT
);

Insert data into the table.

--Insert data into the table.
INSERT INTO TABLE test_t VALUES
(1, 'Alice', 25),
(2, 'Bob', 30),
(3, 'Charlie', 35),
(4, 'David', 40),
(5, 'Eve', 45);

Procedure

Step 1: Connect to an ApsaraDB for SelectDB instance

Connect to an SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance by using a MySQL client.

Step 2: Create a Hive catalog

You can create an external catalog in SelectDB to integrate an external data source. The catalog parameters vary with business scenarios. Select an appropriate syntax and configure the required parameters based on your business scenario.

After you create a catalog, ApsaraDB for SelectDB automatically synchronizes the databases and tables in the catalog. ApsaraDB for SelectDB maps data types of columns in a catalog to ApsaraDB for SelectDB data types. For more information, see the "Column data type mappings" section of this topic.

HDFS-based Hive

OSS-based Hive

Syntax

CREATE CATALOG <catalog_name> PROPERTIES (
    'type'='<type>',
    'hive.metastore.uris' = '<hive.metastore.uris>',
    'hadoop.username' = '<hadoop.username>',
    'dfs.nameservices'='<hadoop.username>',
    'dfs.ha.namenodes.your-nameservice'='<dfs.ha.namenodes.your-nameservice>',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='<dfs.namenode.rpc-address.your-nameservice.nn1>',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='<dfs.namenode.rpc-address.your-nameservice.nn2>',
    'dfs.client.failover.proxy.provider.your-nameservice'='<dfs.client.failover.proxy.provider.your-nameservice>'
);

Parameters

Non-HA scenario

HA scenario

Parameter	Required	Description
catalog_name	Yes	The name of the catalog.
type	Yes	The type of the catalog. Set the value to hms.
hive.metastore.uris	Yes	The URI of the Hive metastore. The URI is in the `thrift://<IP address of the Hive metastore>:<Port number>` format. The default port number is 9083. You can run the `SET hive.metastore.uris` command on a Hive client to obtain the URI.

Parameter	Required	Description
catalog_name	Yes	The name of the catalog.
type	Yes	The type of the catalog. Set the value to hms.
hive.metastore.uris	Yes	The URI of the Hive metastore. The URI is in the `thrift://<IP address of the Hive metastore>:<Port number>` format. The default port number is 9083. You can run the `SET hive.metastore.uris` command on a Hive client to obtain the URI.
hadoop.username	No	The username used to log on to the HDFS.
dfs.nameservices	No	The name of the NameService. The value must be the same as that of the dfs.nameservices parameter in the hdfs-site.xml configuration file on which the existing Hive environment depends.
dfs.ha.namenodes.[nameservice ID]	No	The IDs of the NameNodes. The value must be the same as that of the corresponding parameter in the hdfs-site.xml configuration file on which the existing Hive environment depends.
dfs.namenode.rpc-address.[nameservice ID].[name node ID]	No	The remote procedure call (RPC) address of the NameNode. The number of RPC addresses must be the same as the number of NameNodes. The value must be the same as that of the corresponding parameter in the hdfs-site.xml configuration file on which the existing Hive environment depends.
dfs.client.failover.proxy.provider.[nameservice ID]	No	The Java class that implements the active NameNode connection for the HDFS client. In most cases, set the value to `org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider`.

Sample command for the non-HA scenario

CREATE CATALOG hive_catalog PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://master-1-1.c-7fa25a1a****.cn-hangzhou.emr.aliyuncs.com:9083'
);

Syntax

The syntax for creating a Hive catalog based on Object Storage Service (OSS) is the same as that for creating a Hive catalog based on an HDFS. The only difference is that they use different required parameters.

Parameters

Parameter	Required	Description
type	Yes	The type of the catalog. Set the value to hms.
hive.metastore.uris	Yes	The URI of the Hive metastore. The URI is in the `thrift://<IP address of the Hive metastore>:<Port number>` format. The default port number is 9083. You can run the `SET hive.metastore.uris` command on a Hive client to obtain the URI.
oss.endpoint	Yes	The `endpoint` that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints.
oss.access_key	Yes	The `AccessKey ID` that is used to access OSS data.
oss.secret_key	Yes	The `AccessKey secret` that is used to access OSS data.

Sample command

CREATE CATALOG hive_catalog PROPERTIES (
    "type"="hms",
    "hive.metastore.uris" = "thrift://172.0.0.1:9083",
    "oss.endpoint" = "oss-cn-beijing.aliyuncs.com",
    "oss.access_key" = "ak",
    "oss.secret_key" = "sk"
);

Step 3: View the catalog

You can execute the following statement to check whether the catalog is created:

SHOW CATALOGS; --Check whether the catalog is created.

+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId    | CatalogName  | Type     | IsCurrent | CreateTime              | LastUpdateTime      | Comment                |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | hive_catalog | hms      |           | 2024-07-19 17:09:08.058 | 2024-07-19 18:04:37 |                        |
|            0 | internal     | internal | yes       | UNRECORDED              | NULL                | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+

Step 4: View data in Hive databases and tables

View the Hive databases and tables in the external catalog that you created.
Note
After you connect to an ApsaraDB for SelectDB instance, the internal catalog is used by default.
1. Switch to the external catalog that you want to manage.
```
SWITCH hive_catalog;
```
2. View the data.
  After you switch to the external catalog, you can access the data in the external catalog in the same way as you use the internal catalog. Examples:
  - To view the database list, execute the SHOW DATABASES; statement.
  - To switch to the specified database, execute the USE test_db; statement.
  - To view the table list, execute the SHOW TABLES; statement.
  - To view the data in the specified table, execute the SELECT * FROM test_t; statement.

View the Hive databases and tables in the internal catalog.

--View the data in the test_t table of the test_db database in the hive_catalog catalog. 
SELECT * FROM hive_catalog.test_db.test_t;

Related operation: Migrate data

After the Hive data source is integrated, you can execute INSERT INTO statements to migrate historical data from the Hive data source to SelectDB. For more information, see Import data by using INSERT INTO statements.

Column data type mappings

Note

The following Hive metastore data types are applicable to Hive, Iceberg, and Hudi.
Some Hive metastore and ApsaraDB for SelectDB data types with complex structures can be nested. Nesting examples:
- array<type>: array<map<string, int>>
- map<KeyType, ValueType>: map<string, array<int>>
- struct<col1: Type1, col2: Type2, ...>: struct<col1: array<int>, col2: map<int, date>>

Hive metastore data type	ApsaraDB for SelectDB data type

Hive metastore data type	ApsaraDB for SelectDB data type
boolean	boolean
tinyint	tinyint
smallint	smallint
int	int
bigint	bigint
date	date
timestamp	datetime
float	float
double	double
char	char
varchar	varchar
decimal	decimal
array<type>	array<type>
map<KeyType, ValueType>	map<KeyType, ValueType>
struct<col1: Type1, col2: Type2, ...>	struct<col1: Type1, col2: Type2, ...>
other	unsupported

Prerequisites

Usage notes

Sample environment

Prepare the source data

Procedure

Step 1: Connect to an ApsaraDB for SelectDB instance

Step 2: Create a Hive catalog

Syntax

Parameters

Sample command for the non-HA scenario

Syntax

Parameters

Step 3: View the catalog

Step 4: View data in Hive databases and tables

Related operation: Migrate data

Column data type mappings

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Lingma

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)