All Products
Search
Document Center

ApsaraDB for SelectDB:File cache

Last Updated:May 31, 2024

This topic describes the data lakehouse-related file cache feature of ApsaraDB for SelectDB. The file cache feature can help you efficiently perform federated analytics on external data sources.

Overview

The file cache feature accelerates queries that read the same data by caching the data files of a recently accessed remote storage system, such as Hadoop Distributed File System (HDFS) or an object storage system. In scenarios where the same data is frequently accessed, the file cache feature can reduce costs by preventing repeated remote access and improve the performance and stability of hot data query and analysis.

How it works

The file cache feature caches the accessed remote data to a local backend (BE) node in ApsaraDB for SelectDB. The original data files are divided into blocks by I/O size. The blocks are stored in the local file cache_path/hash(filepath).substr(0, 3)/hash(filepath)/offset, and the metadata of the blocks is stored on the BE node.

If the same remote data is accessed, ApsaraDB for SelectDB checks whether the data of the file to be accessed exists in the local cache, and determines which data is read from the local blocks and which data is remotely obtained based on the offset and size of the blocks. In addition, ApsaraDB for SelectDB caches the new data that is remotely pulled. If the BE node is restarted, ApsaraDB for SelectDB scans the cache_path directory and restores the block metadata.

If the cache size reaches the upper limit, ApsaraDB for SelectDB clears the blocks that have not been accessed for a while based on the Least Recently Used (LRU) principle.

Use the feature

By default, the file cache feature is disabled. To enable this feature, you must configure related parameters on both a frontend (FE) node and a BE node.

FE configurations

  • Enable the file cache feature in the current session.

    SET enable_file_cache = true;
  • Globally enable the file cache feature.

    SET GLOBAL enable_file_cache = true;

The file cache feature takes effect only on queries that are performed on external tables for files, such as queries of data stored in Hive and Hudi. The file cache feature does not take effect on queries that are performed on internal tables or external tables not for files, such as queries of databases that are connected by using Java Database Connectivity (JDBC) and data stored in Elasticsearch.

BE configurations

Add the parameters described in the following table to the conf/be.conf configuration file of a BE node and restart the BE node for the configurations to take effect.

Parameter

Description

enable_file_cache

Specifies whether to enable the file cache feature. Default value: true.

file_cache_path

The configuration related to the cache directory in the JSON format. Example:

[
    {
        "path":"/path/to/file_cache1",
        "total_size":53687091200,
        "query_limit":"10737418240"
    },
    {
        "path":"/path/to/file_cache2",
        "total_size":53687091200,
        "query_limit":"10737418240"
    },
    {
        "path":"/path/to/file_cache3",
        "total_size":53687091200,
        "query_limit":"10737418240"
    }
]

In the configuration, path specifies the cache path, total_size specifies the maximum cache size, and query_limit specifies the maximum cache size that can be used by a single query.

file_cache_min_file_segment_size

The minimum size of a single block. Unit: bytes. Default value: 1048576. The value must be greater than 4096.

file_cache_max_file_segment_size

The maximum size of a single block. Unit: bytes. Default value: 4194304. The value must be greater than 4096.

enable_file_cache_query_limit

Specifies whether to restrict the cache size used by a single query. Default value: false.

clear_file_cache

Specifies whether to delete the existing cached data during the restart of the BE node. Default value: false.

View the cache hit details

Run the set enable_profile=true command to enable the session variable enable_profile. Then, you can view the job profile on the Queries tab of the web page for the FE node. The file cache feature has the following metrics:

-  FileCache:
  -  IOHitCacheNum:  552
  -  IOTotalNum:  835
  -  ReadFromFileCacheBytes:  19.98  MB
  -  ReadFromWriteCacheBytes:  0.00  
  -  ReadTotalBytes:  29.52  MB
  -  WriteInFileCacheBytes:  915.77  MB
  -  WriteInFileCacheNum:  283 

The following table describes the metrics.

Metric

Description

IOTotalNum

The number of remote accesses.

IOHitCacheNum

The number of cache hits.

ReadFromFileCacheBytes

The amount of data that is read from the cache file.

ReadTotalBytes

The total amount of data that is read.

SkipCacheBytes

The amount of data that is remotely read again because the cache file fails to be created or the cache file is deleted.

WriteInFileCacheBytes

The amount of data that is saved to the cache file.

WriteInFileCacheNum

The number of blocks that are stored. The average block size can be calculated by using the following formula: WriteInFileCacheBytes/WriteInFileCacheBytes.

The cache is hit for all queries in the following cases:

  • IOHitCacheNum/IOTotalNum = 1

  • ReadFromFileCacheBytes/ReadTotalBytes = 1