Data lakehouse

0.0.201

This topic describes how to use the federated query technology provided by ApsaraDB for SelectDB to perform federated analysis on external data sources such as data lakes, databases, and remote files. This provides you with a simple and fast data analysis experience.

Overview

ApsaraDB for SelectDB provides the multi-catalog feature to connect to external data sources such as data lakes and databases for simple and fast data analysis. Based on the multi-catalog feature, ApsaraDB for SelectDB provides a new three-tiered metadata hierarchy: catalog -> database -> table. A catalog directly maps an external data source. For more information about the external data sources supported by ApsaraDB for SelectDB, see Data lake analysis .

ApsaraDB for SelectDB also provides table-valued functions (TVFs) that you can use to map file data in common remote storage such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed System (HDFS) to tables in ApsaraDB for SelectDB. This facilitates the analysis of file data. For more information, see File analysis.

This topic describes the terms and usage of catalogs. The multi-catalog feature extends the capabilities of ApsaraDB for SelectDB so that ApsaraDB for SelectDB can not only manage internal data but also efficiently integrate and query external data sources. You can use catalogs to efficiently access and analyze data stored in different data warehouses and storage systems from ApsaraDB for SelectDB. This allows you to implement more flexible and comprehensive data management and analysis.

Terms and operations

Internal catalog
Existing databases and tables in ApsaraDB for SelectDB all belong to the internal catalog. The internal catalog is the built-in default catalog in ApsaraDB for SelectDB and cannot be modified or deleted.
External catalog
You can create an external catalog by executing the CREATE CATALOG statement. You can execute the SHOW CATALOGS statement to view all catalogs or the SHOW CREATE CATALOG <catalog_name>; statement to query the creation statement of a catalog.
Switch to a catalog
After you log on to ApsaraDB for SelectDB, the internal catalog is used by default. You can execute the SWITCH statement to switch to a catalog. Examples:
```
SWITCH internal;
SWITCH hive_catalog;
```
After you switch to a catalog, you can execute statements such as SHOW DATABASES or USE <db_name> to view all databases in the catalog or switch to a database in the catalog. ApsaraDB for SelectDB automatically accesses databases and tables in external data sources by using catalogs. You can view and access data in an external catalog in the same way you use the internal catalog. ApsaraDB for SelectDB supports read-only access to data in external catalogs.
Delete a catalog
Databases and tables in an external catalog are read-only. However, you can delete an external catalog by executing the DROP CATALOG <catalog_name>; statement if necessary. The internal catalog cannot be deleted.
Note
This operation only deletes the mapping information about catalogs in ApsaraDB for SelectDB and does not modify or change the content of any external data sources.

Column type mapping

After you create a catalog for an external data source, ApsaraDB for SelectDB automatically synchronizes the databases and tables from the external data source to the catalog. For different types of catalogs and tables, ApsaraDB for SelectDB performs mapping between column types.

For external data types that cannot be mapped to an ApsaraDB for SelectDB column type, such as UNION and INTERVAL, ApsaraDB for SelectDB maps them to UNSUPPORTED. The following sample code shows the results of querying mapped types and the UNSUPPORTED type.

-- Table schema after synchronization:
k1 INT,
k2 INT,
k3 UNSUPPORTED,
k4 INT

-- Query results:
SELECT * FROM testtable;                // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
SELECT * except(k3) FROM testtable;     // Query OK.
SELECT k1, k3 FROM testtable;           // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
SELECT k1, k4 FROM testtable;           // Query OK.

You can view the data types supported by ApsaraDB for SelectDB by executing the SHOW DATA TYPES; statement. For more information about the data types supported by different external data sources, see Data lake analysis.

Permission management

By default, ApsaraDB for SelectDB uses its own permission management feature to control access to databases and tables in an external catalog. Along with the multi-catalog feature, permission management is added at the catalog level. For more information, see Permission management.

Database synchronization management

You can specify the databases to be synchronized by configuring the include_database_list and exclude_database_list parameters of a catalog.

include_database_list: the databases to be synchronized. Only databases specified in this property are synchronized. Separate database names with commas (,). By default, all databases are synchronized. Database names are case-sensitive.
exclude_database_list: the databases that do not need to be synchronized. Separate database names with commas (,). This parameter is empty by default, indicating that all databases are synchronized. Database names are case-sensitive.

Important

The exclude_database_list parameter takes precedence over the include_database_list parameter. If a database is specified in both the include_database_list and exclude_database_list parameters, the database is not synchronized.
If a catalog is used to connect to a Java Database Connectivity (JDBC) data source, you must use the preceding two parameters with the only_specified_database parameter. For more information, see JDBC data source.

Metadata update

By default, metadata changes in external data sources, such as table or column creation and deletion, are not synchronized to ApsaraDB for SelectDB. You can use one of the following methods to refresh metadata.

Manual refresh

Execution the REFRESH statement to manually refresh metadata.

Syntax

REFRESH CATALOG catalog_name;
REFRESH DATABASE [catalog_name.]database_name;
REFRESH TABLE [catalog_name.][database_name.]table_name;

If you manually refresh a catalog, object-related caches are forcibly invalidated, including the partition cache, schema cache, and file cache.

Examples

Refresh a catalog.
```
REFRESH CATALOG hive;
```

Refresh a database.

REFRESH DATABASE ctl.database1;
REFRESH DATABASE database1;

Refresh a table.

REFRESH TABLE ctl.db.table1;
REFRESH TABLE db.table1;
REFRESH TABLE table1;

Scheduled refresh

When you create a catalog, you can enable scheduled refresh by specifying the metadata_refresh_interval_sec parameter in properties. The scheduled refresh interval is measured in seconds. After you specify this parameter, the FE master node periodically refreshes the catalog based on the value of this parameter. Only the following three types of data sources support scheduled refresh:

Hive Metastore Service (HMS)
Elasticsearch
JDBC data source

-- Set the catalog refresh interval to 20s.
CREATE CATALOG es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200",
    "metadata_refresh_interval_sec"="20"
);

Auto refresh

Only specific events of a Hive data source can be automatically refreshed to catalogs. For more information, see the Cache and refresh metadata section of the "Hive data source" topic.

Feedback