This topic describes the data lakehouse-related file cache feature of ApsaraDB for SelectDB. The file cache feature can help you efficiently perform federated analytics on external data sources.
Overview
The file cache feature accelerates queries that read the same data by caching the data files of a recently accessed remote storage system, such as Hadoop Distributed File System (HDFS) or an object storage system. In scenarios where the same data is frequently accessed, the file cache feature can reduce costs by preventing repeated remote access and improve the performance and stability of hot data query and analysis.
How it works
The file cache feature caches the accessed remote data to a local backend (BE) node in ApsaraDB for SelectDB. The original data files are divided into blocks by I/O size. The blocks are stored in the local file cache_path/hash(filepath).substr(0, 3)/hash(filepath)/offset, and the metadata of the blocks is stored on the BE node.
If the same remote data is accessed, ApsaraDB for SelectDB checks whether the data of the file to be accessed exists in the local cache, and determines which data is read from the local blocks and which data is remotely obtained based on the offset and size of the blocks. In addition, ApsaraDB for SelectDB caches the new data that is remotely pulled. If the BE node is restarted, ApsaraDB for SelectDB scans the cache_path directory and restores the block metadata.
If the cache size reaches the upper limit, ApsaraDB for SelectDB clears the blocks that have not been accessed for a while based on the Least Recently Used (LRU) principle.
Use the feature
By default, the file cache feature is disabled. To enable this feature, you must configure related parameters on both a frontend (FE) node and a BE node.
FE configurations
Enable the file cache feature in the current session.
SET enable_file_cache = true;Globally enable the file cache feature.
SET GLOBAL enable_file_cache = true;
The file cache feature takes effect only on queries that are performed on external tables for files, such as queries of data stored in Hive and Hudi. The file cache feature does not take effect on queries that are performed on internal tables or external tables not for files, such as queries of databases that are connected by using Java Database Connectivity (JDBC) and data stored in Elasticsearch.
BE configurations
Add the parameters described in the following table to the conf/be.conf configuration file of a BE node and restart the BE node for the configurations to take effect.
Parameter | Description |
enable_file_cache | Specifies whether to enable the file cache feature. Default value: true. |
file_cache_path | The configuration related to the cache directory in the JSON format. Example: In the configuration, path specifies the cache path, total_size specifies the maximum cache size, and query_limit specifies the maximum cache size that can be used by a single query. |
file_cache_min_file_segment_size | The minimum size of a single block. Unit: bytes. Default value: 1048576. The value must be greater than 4096. |
file_cache_max_file_segment_size | The maximum size of a single block. Unit: bytes. Default value: 4194304. The value must be greater than 4096. |
enable_file_cache_query_limit | Specifies whether to restrict the cache size used by a single query. Default value: false. |
clear_file_cache | Specifies whether to delete the existing cached data during the restart of the BE node. Default value: false. |
View the cache hit details
Run the set enable_profile=true command to enable the session variable enable_profile. Then, you can view the job profile on the Queries tab of the web page for the FE node. The file cache feature has the following metrics:
- FileCache:
- IOHitCacheNum: 552
- IOTotalNum: 835
- ReadFromFileCacheBytes: 19.98 MB
- ReadFromWriteCacheBytes: 0.00
- ReadTotalBytes: 29.52 MB
- WriteInFileCacheBytes: 915.77 MB
- WriteInFileCacheNum: 283 The following table describes the metrics.
Metric | Description |
IOTotalNum | The number of remote accesses. |
IOHitCacheNum | The number of cache hits. |
ReadFromFileCacheBytes | The amount of data that is read from the cache file. |
ReadTotalBytes | The total amount of data that is read. |
SkipCacheBytes | The amount of data that is remotely read again because the cache file fails to be created or the cache file is deleted. |
WriteInFileCacheBytes | The amount of data that is saved to the cache file. |
WriteInFileCacheNum | The number of blocks that are stored. The average block size can be calculated by using the following formula: WriteInFileCacheBytes/WriteInFileCacheBytes. |
The cache is hit for all queries in the following cases:
IOHitCacheNum/IOTotalNum = 1ReadFromFileCacheBytes/ReadTotalBytes = 1