This topic describes what a catalog is and how to use a catalog to manage and query internal and external data.
Terms
- Internal data: the data that is stored in StarRocks.
- External data: the data that is stored in external data sources, such as Apache Hive, Apache Iceberg, and Apache Hudi.
Catalog
StarRocks 2.3 and later allow you to use catalogs to access and query data that is stored in various external data sources with ease. StarRocks supports two types of catalogs: internal catalogs and external catalogs.
- Internal catalog: used to manage all internal data in a StarRocks cluster. For example, databases and tables that are created by executing the
CREATE DATABASE
andCREATE TABLE
statements are managed in the internal catalog of the StarRocks cluster. Each StarRocks cluster has only one internal catalog nameddefault catalog
. - External catalog: used to manage the access information of external data sources, such as the data source types and the uniform resource identifiers (URIs) of Hive metastores. In StarRocks, you can directly query external data by using an external catalog.
You can create an external catalog for the following types of data sources:
- Query Hive data: used to query Hive data.
- Query Iceberg data: used to query Iceberg data.
- Hudi data sources: used to query Hudi data.
When you use an external catalog to query data from an external data source, StarRocks uses two components of the external data source:- Metadata service: used to expose metadata for a frontend (FE) of a StarRocks cluster to generate a query plan.
- Storage system: used to store data. Data files are stored in different formats in a distributed file system or an object storage system. After the FE distributes the generated query plan to each backend (BE), each BE scans the destination data in the Hive storage system in parallel, performs computing, and then returns the query results.
Query data
Query internal data
For more information about how to query data that is stored in StarRocks, see Default catalog.
Query external data
For more information about how to query data that is stored in external data sources, see Data lake analytics.
Query data across catalogs
If you want to query data across catalogs, you can reference the destination data by specifying the destination in the format of
catalog_name.db_name
or
catalog_name.db_name.table_name
. Examples:
- In the
default_catalog.olap_db
catalog, execute the following statement to query data from thehive_table
table in thehive_catalog
catalog:SELECT * FROM hive_catalog.hive_db.hive_table;
- In the
hive_catalog.hive_db
catalog, execute the following statement to query data from theolap_table
table in thedefault_catalog
catalog:SELECT * FROM default_catalog.olap_db.olap_table;
- In the
hive_catalog.hive_db
catalog, execute the following statement to perform a federated query on the hive_table table and theolap_table
table in thedefault_catalog
catalog:SELECT * FROM hive_table h JOIN default_catalog.olap_db.olap_table o WHERE h.id = o.id;
- In other catalogs, execute the following statement to perform a federated query on the
hive_table
table in thehive_catalog
catalog and theolap_table
table in thedefault_catalog
catalog:SELECT * FROM hive_catalog.hive_db.hive_table h JOIN default_catalog.olap_db.olap_table o WHERE h.id = o.id;