This topic describes how to use the federated query technology provided by ApsaraDB for SelectDB to perform federated analysis on external data sources such as data lakes, databases, and remote files. This provides you with a simple and fast data analysis experience.
Overview
ApsaraDB for SelectDB provides the multi-catalog feature to connect to external data sources such as data lakes and databases for simple and fast data analysis. Based on the multi-catalog feature, ApsaraDB for SelectDB provides a new three-tiered metadata hierarchy: catalog -> database -> table
. A catalog directly maps an external data source. For more information about the external data sources supported by ApsaraDB for SelectDB, see Data lake analysis .
ApsaraDB for SelectDB also provides table-valued functions (TVFs) that you can use to map file data in common remote storage such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed System (HDFS) to tables in ApsaraDB for SelectDB. This facilitates the analysis of file data. For more information, see File analysis.
This topic describes the terms and usage of catalogs. The multi-catalog feature extends the capabilities of ApsaraDB for SelectDB so that ApsaraDB for SelectDB can not only manage internal data but also efficiently integrate and query external data sources. You can use catalogs to efficiently access and analyze data stored in different data warehouses and storage systems from ApsaraDB for SelectDB. This allows you to implement more flexible and comprehensive data management and analysis.
Terms and operations
Internal catalog
Existing databases and tables in ApsaraDB for SelectDB all belong to the internal catalog. The internal catalog is the built-in default catalog in ApsaraDB for SelectDB and cannot be modified or deleted.
External catalog
You can create an external catalog by executing the
CREATE CATALOG
statement. You can execute theSHOW CATALOGS
statement to view all catalogs or theSHOW CREATE CATALOG <catalog_name>;
statement to query the creation statement of a catalog.Switch to a catalog
After you log on to ApsaraDB for SelectDB, the internal catalog is used by default. You can execute the
SWITCH
statement to switch to a catalog. Examples:SWITCH internal; SWITCH hive_catalog;
After you switch to a catalog, you can execute statements such as
SHOW DATABASES
orUSE <db_name>
to view all databases in the catalog or switch to a database in the catalog. ApsaraDB for SelectDB automatically accesses databases and tables in external data sources by using catalogs. You can view and access data in an external catalog in the same way you use the internal catalog. ApsaraDB for SelectDB supports read-only access to data in external catalogs.Delete a catalog
Databases and tables in an external catalog are read-only. However, you can delete an external catalog by executing the
DROP CATALOG <catalog_name>;
statement if necessary. The internal catalog cannot be deleted.NoteThis operation only deletes the mapping information about catalogs in ApsaraDB for SelectDB and does not modify or change the content of any external data sources.
Column type mapping
After you create a catalog for an external data source, ApsaraDB for SelectDB automatically synchronizes the databases and tables from the external data source to the catalog. For different types of catalogs and tables, ApsaraDB for SelectDB performs mapping between column types.
For external data types that cannot be mapped to an ApsaraDB for SelectDB column type, such as UNION
and INTERVAL
, ApsaraDB for SelectDB maps them to UNSUPPORTED. The following sample code shows the results of querying mapped types and the UNSUPPORTED type.
-- Table schema after synchronization:
k1 INT,
k2 INT,
k3 UNSUPPORTED,
k4 INT
-- Query results:
SELECT * FROM testtable; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
SELECT * except(k3) FROM testtable; // Query OK.
SELECT k1, k3 FROM testtable; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
SELECT k1, k4 FROM testtable; // Query OK.
You can view the data types supported by ApsaraDB for SelectDB by executing the SHOW DATA TYPES;
statement. For more information about the data types supported by different external data sources, see Data lake analysis.
Permission management
By default, ApsaraDB for SelectDB uses its own permission management feature to control access to databases and tables in an external catalog. Along with the multi-catalog feature, permission management is added at the catalog level. For more information, see Permission management.
Database synchronization management
You can specify the databases to be synchronized by configuring the include_database_list
and exclude_database_list
parameters of a catalog.
include_database_list
: the databases to be synchronized. Only databases specified in this property are synchronized. Separate database names with commas (,). By default, all databases are synchronized. Database names are case-sensitive.exclude_database_list
: the databases that do not need to be synchronized. Separate database names with commas (,). This parameter is empty by default, indicating that all databases are synchronized. Database names are case-sensitive.
The
exclude_database_list
parameter takes precedence over the include_database_list parameter. If a database is specified in both theinclude_database_list
andexclude_database_list
parameters, the database is not synchronized.If a catalog is used to connect to a Java Database Connectivity (JDBC) data source, you must use the preceding two parameters with the
only_specified_database
parameter. For more information, see JDBC data source.
Metadata update
By default, metadata changes in external data sources, such as table or column creation and deletion, are not synchronized to ApsaraDB for SelectDB. You can use one of the following methods to refresh metadata.
Manual refresh
Execution the REFRESH statement to manually refresh metadata.
Syntax
REFRESH CATALOG catalog_name;
REFRESH DATABASE [catalog_name.]database_name;
REFRESH TABLE [catalog_name.][database_name.]table_name;
If you manually refresh a catalog, object-related caches are forcibly invalidated, including the partition cache, schema cache, and file cache.
Examples
Refresh a catalog.
REFRESH CATALOG hive;
Refresh a database.
REFRESH DATABASE ctl.database1; REFRESH DATABASE database1;
Refresh a table.
REFRESH TABLE ctl.db.table1; REFRESH TABLE db.table1; REFRESH TABLE table1;
Scheduled refresh
When you create a catalog, you can enable scheduled refresh by specifying the metadata_refresh_interval_sec
parameter in properties. The scheduled refresh interval is measured in seconds. After you specify this parameter, the FE master node periodically refreshes the catalog based on the value of this parameter. Only the following three types of data sources support scheduled refresh:
Hive Metastore Service (HMS)
Elasticsearch
JDBC data source
-- Set the catalog refresh interval to 20s.
CREATE CATALOG es PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200",
"metadata_refresh_interval_sec"="20"
);
Auto refresh
Only specific events of a Hive data source can be automatically refreshed to catalogs. For more information, see the Cache and refresh metadata section of the "Hive data source" topic.