Best practices for Fluid data cache optimization policies - Container Service for Kubernetes

In a compute-storage decoupled architecture, you can use Fluid data caching to resolve high latency and limited bandwidth issues when you access storage systems in Kubernetes clusters. This improves data processing efficiency. This topic explains how to use Fluid data cache policies to optimize for performance, stability, and read-write consistency.

Scenarios

Caching technologies improve data access performance based on the principle of locality. Scenarios such as AI training, inference service startup, and big data analytics all feature repetitive data access. For example:

In AI training, datasets are read periodically to support model iterations.
When an inference service starts, multiple instances concurrently load the same model file into GPU memory.
In big data analytics, when SparkSQL processes user personas and product information, order data is shared by multiple tasks.

Fluid data caching can be used to improve efficiency in all of the preceding scenarios.

Fluid data cache optimization policies for performance

To improve performance with Fluid data caching, you can configure ECS instance types, cache media, cache system parameters, and management policies based on your performance requirements and budget. You can also adapt the client's data read mode to optimize performance. This topic explains how to apply these configuration elements together.

Policy 1: Select ECS instance types for the cache system

Performance estimation

A distributed file cache system can aggregate the storage and bandwidth resources on multiple nodes to provide larger cache capacity and higher available bandwidth for applications. You can estimate the theoretical upper limits for cache capacity, available bandwidth, and maximum application data access bandwidth using the following formulas:

Available cache capacity = Cache capacity per distributed cache Worker pod × Number of distributed cache Worker pod replicas
Available cache bandwidth = Number of distributed cache Worker pod replicas × min{Maximum available bandwidth of the ECS node where the Worker pod is located, I/O throughput of the cache medium used by the Worker pod}. The available cache bandwidth does not represent the actual bandwidth during data access. The actual bandwidth is affected by the available bandwidth of the client ECS node and the access mode, such as sequential or random reads.
Theoretical maximum bandwidth for application pod data access = min{Available bandwidth of the ECS node where the application pod is located, Available cache bandwidth}. When multiple application pods access data concurrently, the available cache bandwidth is shared among these pods.

Example

For example, you can scale out an ACK cluster by adding two ecs.g7nex.8xlarge ECS instances to build a distributed cache cluster. The cluster contains two cache Worker pods. Each pod is configured with 100 GiB of memory for caching, and the two pods run on separate ECS instances. The application pod is deployed on one ecs.gn7i-c8g1.2xlarge ECS instance (8 vCPUs, 30 GiB of memory, and 16 Gbps of bandwidth). The theoretical values for cache capacity, available bandwidth, and maximum bandwidth are as follows:

Available cache capacity = 100 GiB × 2 = 200 GiB
Available cache bandwidth = 2 × min{40 Gbps, Memory access I/O throughput} = 80 Gbps
When a cache hit occurs, the maximum available bandwidth for application pod data access = min{80 Gbps, 16 Gbps} = 16 Gbps

Recommended ECS instance types

The available bandwidth of a distributed cache cluster depends on the maximum available bandwidth of each ECS node in the cluster and the cache media used. To improve the performance of the distributed cache system, you can use high-bandwidth ECS instance types and use memory, local HDDs, or local SSDs as cache media. The following table shows the recommended ECS instance types.

ECS instance family	ECS instance type	ECS instance configuration
g7nex, network-enhanced general-purpose instance family	ecs.g7nex.8xlarge	32 vCPUs, 128 GiB memory, 40 Gbps bandwidth
	ecs.g7nex.16xlarge	64 vCPUs, 256 GiB memory, 80 Gbps bandwidth
	ecs.g7nex.32xlarge	128 vCPUs, 512 GiB memory, 160 Gbps bandwidth
i4g, instance family with local SSDs	ecs.i4g.16xlarge	64 vCPUs, 256 GiB memory, 2 × 1920 GB local SSD storage, 32 Gbps bandwidth
i4g, instance family with local SSDs	ecs.i4g.32xlarge	128 vCPUs, 512 GiB memory, 4 × 1920 GB local SSD storage, 64 Gbps bandwidth
g7ne, network-enhanced general-purpose instance family	ecs.g7ne.8xlarge	32 vCPUs, 128 GiB memory, 25 Gbps bandwidth
	ecs.g7ne.12xlarge	48 vCPUs, 192 GiB memory, 40 Gbps bandwidth
	ecs.g7ne.24xlarge	96 vCPUs, 384 GiB memory, 80 Gbps bandwidth
g8i, general-purpose instance family	ecs.g8i.24xlarge	96 vCPUs, 384 GiB memory, 50 Gbps bandwidth
g8i, general-purpose instance family	ecs.g8i.16xlarge	64 vCPUs, 256 GiB memory, 32 Gbps bandwidth

For more information about ECS instances, see Instance families.

Policy 2: Select cache media

The available bandwidth of a distributed cache cluster depends on the maximum available bandwidth of each ECS node in the cluster and the cache media used. To improve the performance of the distributed cache system, you can use high-bandwidth ECS instance types and use memory or local SSDs as cache media. In Fluid, you can configure different cache media and cache capacities by setting the spec.tieredstore parameter of the Runtime resource object.

Note

Using an enterprise SSD (ESSD) as a cache medium often cannot meet the high-performance data access needs of data-intensive applications. For example, the maximum throughput of a single PL2 disk is 750 MB/s. This means that if you use only one PL2 disk as the cache medium, the maximum available cache bandwidth that the ECS node can provide is only 750 MB/s, even if you choose an ECS instance type with an available bandwidth greater than 750 MB/s. This configuration wastes the maximum available bandwidth of the ECS node.

Use memory as a cache medium

If you use memory as a cache medium, you can configure the spec.tieredstore of the Runtime resource object as follows:

spec:
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 30Gi # The cache capacity that a single distributed cache Worker replica can provide.
        high: "0.99"
        low: "0.95"

Use local storage as a cache medium

If you use local system disk storage as a cache medium:

spec:
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: emptyDir # Use emptyDir to ensure that the lifecycle of cached data is the same as that of the distributed cache Worker pod. This prevents residual cached data.
        path: /var/lib/fluid/cache
        quota: 100Gi # The cache capacity that a single distributed cache Worker replica can provide.
        high: "0.99"
        low: "0.95"

If you use an additionally mounted SSD data disk as a cache medium:

spec:
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: hostPath
        path: /mnt/disk1 # /mnt/disk1 is the mount path of the local SSD on the host.
        quota: 100Gi # The cache capacity that a single distributed cache Worker replica can provide.
        high: "0.99"
        low: "0.95"

If you use multiple SSD data disks as cache media at the same time:

spec:
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: hostPath
        path: /mnt/disk1,/mnt/disk2 # /mnt/disk1 and /mnt/disk2 are the mount paths of the data disks on the host.
        quota: 100Gi # The cache capacity that a single distributed cache Worker replica can provide. The capacity is evenly distributed across multiple cache paths. For example, /mnt/disk1 and /mnt/disk2 are each allocated 50 GiB of capacity.
        high: "0.99"
        low: "0.95"

Policy 3: Configure scheduling affinity between the data cache and applications

When a data access request hits the cache, the application pod reads data from the cache system. Therefore, if the application pod and the cache system are deployed in different zones, the application needs to access cached data across zones. To reduce the impact of cross-zone network fluctuations on the data access process, you should consider the scheduling affinity between the application pod and the relevant pods of the cache system. Specifically:

Deploy the cache system Worker pods with affinity in the same zone whenever possible.
Deploy the application pods and the cache system Worker pods with affinity in the same zone whenever possible.

Important

Deploying multiple application pods and cache system Worker pods in a single zone reduces the disaster recovery capability of the application and related services. You can balance the performance impact and service availability based on your business Service-Level Agreement (SLA).

In Fluid, you can configure the scheduling affinity of the cache system Worker pods by setting the spec.nodeAffinity of the Dataset resource object, as shown below:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - <ZONE_ID> # The zone where the pod is located, for example, cn-beijing-i.

The preceding configuration deploys all distributed cache system Worker pods on ECS nodes in the <ZONE_ID> zone.

In addition, Fluid can automatically inject the affinity information for the required cache into the application pod. This ensures that the application pod and the cache system Worker pods are deployed with affinity in the same zone whenever possible. For more information, see Optimize scheduling based on data cache affinity.

Policy 4: Optimize parameter configuration for large file full sequential read scenarios

Many data-intensive scenarios involve a data access mode of full sequential reads on large files. Examples include model training based on datasets in TFRecord or Tar format, loading one or more model parameter files when an AI model inference service starts, and reading Parquet file formats for distributed data analytics. For such scenarios, you can use a more aggressive prefetch policy to improve data access performance. For example, you can appropriately increase the prefetch concurrency and prefetch data volume of the cache system.

In Fluid, different distributed cache runtimes require different parameter configurations for their prefetch policies, as described below:

Use JindoRuntime as the cache runtime

If you use JindoRuntime as the cache runtime, you can configure the spec.fuse.properties parameter to customize the behavior and performance of Jindo FUSE.

kind: JindoRuntime
metadata:
  ...
spec:
  fuse:
    properties:
      fs.oss.download.thread.concurrency: "200"
      fs.oss.read.buffer.size: "8388608"
      fs.oss.read.readahead.max.buffer.count: "200"
      fs.oss.read.sequence.ambiguity.range: "2147483647" # About 2 GB.

fs.oss.download.thread.concurrency: The number of concurrent prefetch threads for the Jindo client. Each thread is used to prefetch a buffer.
fs.oss.read.buffer.size: The size of a single buffer.
fs.oss.read.readahead.max.buffer.count: The maximum number of buffers for a single-stream prefetch by the Jindo client.
fs.oss.read.sequence.ambiguity.range: The range that the Jindo client uses to determine whether a process is sequentially reading a file.

Use JuiceFSRuntime as the cache runtime

If you use JuiceFSRuntime as the cache runtime, you can set the parameters for the FUSE and Worker components by configuring the spec.fuse.options and spec.worker.options of the Runtime resource object, respectively.

kind: JuiceFSRuntime
metadata:
  ...
spec:
  fuse:
    options:
      buffer-size: "2048"
      cache-size: "0"
      max-uploads: "150"
  worker:
    options:
      buffer-size: "2048"
      max-downloads: "200"
      max-uploads: "150"
  master:
    resources:
      requests:
        memory: 2Gi
      limits:
        memory: 8Gi
  ...

buffer-size: The read/write buffer size.
max-downloads: The download concurrency for the prefetch process.
fuse cache-size: The local cache capacity available to the JuiceFS FUSE component.

For example, you can set the local cache capacity available to the JuiceFS FUSE component to 0, and set the FUSE component memory requests to a small value, such as 2 GiB. The FUSE component will automatically use the available memory on the node as a near-local cache (Page Cache) in the Linux file system kernel. When a near-local cache miss occurs, it reads data directly from the JuiceFS Worker distributed cache to achieve efficient access. For more information about performance tuning and parameter details, see the official JuiceFS documentation.

Fluid data cache optimization policies for stability

To ensure the stable operation of the Fluid data cache system, you can implement several optimization policies. Avoid mounting directories with many files as mount targets to prevent single-point overload. Use persistent storage, such as enterprise SSDs (ESSDs), to implement metadata persistence and enhance disaster recovery capabilities. Allocate FUSE pod resources to balance supply and demand, and enable the self-healing mechanism for automatic fault recovery. The following sections describe how to implement these key policies.

Policy 1: Avoid mounting directories that contain many files as the underlying data source

The cache system must maintain the metadata of all files in the mounted underlying storage directory and record additional metadata, such as the cache status of each file. If the cache system mounts an underlying storage directory that contains many files, such as the root directory of a large-scale storage system, the cache system must use a large amount of memory resources to store the metadata and more CPU resources to process metadata access requests.

Fluid defines a Dataset as a logically related set of data for a specific application. A Dataset corresponds to the startup of a distributed cache system. A Dataset should provide data access acceleration services for only one or a few related data-intensive tasks. Therefore, you can create multiple Datasets and define different subdirectories of the underlying data source in each Dataset. The specific subdirectory level depends on the data collection that the business application requires. Different business applications can use the cache systems bound to different Datasets to access data. This provides better isolation between business applications and ensures system stability and performance. A sample Dataset configuration is as follows:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  mounts:
  - mountPoint: oss://<BUCKET>/<PATH1>/<SUBPATH>/
    name: sub-bucket

Creating multiple cache systems may lead to additional Operations and Maintenance (O&M) complexity. You can flexibly choose an architectural solution based on your actual business needs. For example:

If the dataset to be cached is stored in a single backend storage system, is small, contains few files, and has strong relevance, you can create a single Fluid Dataset and a distributed cache system to provide services.
If the dataset to be cached is large and contains many files, you can split it into multiple Fluid Datasets and distributed cache systems based on the business logic of the data directory. The application pod can declare one or more Fluid Datasets to be mounted to specified directories.
If the datasets to be cached originate from different storage systems for multiple users and data isolation between users is required, you can create short-lived Fluid Datasets for each user or for a series of data-related jobs for a user. You can also develop a Kubernetes Operator to flexibly manage multiple Fluid Datasets in the cluster.

Policy 2: Use metadata persistence to improve cache system stability

Many distributed cache systems use a Master-Worker distributed architecture. They rely on the Master pod component to maintain the file metadata of the mounted backend storage directory and record the cache status of each file. When an application accesses data through such a cache system, it first obtains the file metadata from the Master pod component and then retrieves the data from the underlying file storage or the cache Worker component. Therefore, the stability of the cache Master pod component is crucial for the high availability of the cache system.

If you use JindoRuntime as the cache runtime, you can use the following configuration in Fluid to improve the availability of the cache Master pod component:

apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: sd-dataset
spec:
  ...
  volumes:
    - name: meta-vol
      persistentVolumeClaim:
        claimName: demo-jindo-master-meta
  master:
    resources:
      requests:
        memory: 4Gi
      limits:
        memory: 8Gi
    volumeMounts:
      - name: meta-vol
        mountPath: /root/jindofs-meta
    properties:
      namespace.meta-dir: "/root/jindofs-meta"

In the preceding YAML example, demo-jindo-master-meta is a pre-created persistent volume claim (PVC). This PVC uses an ESSD as a persistent volume (PV), which means that the metadata maintained by the Master pod component can be stored persistently and can be migrated with the pod. For more information, see Use JindoRuntime to persist the state of the Master pod component.

Policy 3: Configure FUSE pod resources

The cache system client program runs in the FUSE pod. The FUSE pod mounts a FUSE file system on the node. This file system is then mounted to a specified path on the application pod and exposes a POSIX file access interface. This allows the application pod to access data in a remote storage system as if it were accessing local storage without modifying the application code, and to benefit from cache acceleration. The FUSE program maintains the data tunnel between the application and the cache system. Therefore, we recommend that you configure the resources for the FUSE pod.

In Fluid, you can set the resource usage of the FUSE pod by configuring the spec.fuse.resources field of the Runtime resource object. To avoid broken file mount targets caused by FUSE program out-of-memory (OOM) errors, we recommend that you do not set a memory limit for the FUSE pod, or that you set a large memory limit. For example, you can set the memory limit of the FUSE pod to be close to the allocatable memory size of the ECS node.

spec:
  fuse:
    resources:
      requests:
        memory: 8Gi
    # limits:
    #   memory: <ECS_ALLOCATABLE_MEMORY>

Policy 4: Enable the FUSE self-healing feature to improve data access client availability

The FUSE program maintains the data tunnel between the application and the cache system. By default, if the FUSE program crashes due to unexpected behavior, the application container can no longer access data through that FUSE file system mount target, even after the FUSE program restarts and recovers. To restore data access, the application container must be restarted. This restart re-triggers the bind mount logic from the FUSE file system mount target to the application pod. This affects the availability of the application pod.

Fluid provides a FUSE self-healing mechanism. When this mechanism is enabled, the application container does not need to be restarted. Data access to the FUSE mount target within the application container is automatically restored a short time after the FUSE program restarts.

Important

In the short period between the FUSE program crash and its restart, the mount target in the application pod is inaccessible. This means the application must handle these I/O errors to avoid crashing and should include retry logic.

To enable and use the FUSE self-healing feature, see How to enable FUSE auto-recovery capability. The FUSE self-healing feature has many limitations. We recommend that you enable this feature only in specific application scenarios with clear requirements and suitable business needs. Examples include interactive programming development scenarios where you use an online Notebook or other environments to interactively develop and debug applications.

Best practices for cache read/write consistency

While a cache system improves data access efficiency, it also introduces cache consistency issues. Because strong consistency usually results in performance degradation or increased Operations and Maintenance (O&M) costs, you should select a read/write consistency policy that fits your business scenario. The following sections detail how to configure cache read/write consistency in multiple scenarios.

Scenario 1: Data is read-only and backend storage data does not change

Use case: During a single AI model training process, data samples are read from a fixed dataset for iterative model training. The data cache is cleared after the training is complete.

Configuration: This scenario is supported by Fluid by default. You can use the default Fluid Dataset configuration or explicitly set it to read-only.

Example configuration:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  # accessModes: ["ReadOnlyMany"] ReadOnlyMany is the default value.

Setting accessModes: ["ReadOnlyMany"] ensures that the dataset is mounted in read-only mode. This prevents accidental modification of the dataset during the training process and simplifies data management and cache policies.

Scenario 2: Data is read-only and backend storage data changes periodically

Use case: The cache system resides in the Kubernetes cluster. Business-related data is collected daily and stored in the backend storage system. Data analysis jobs are executed regularly at midnight to analyze and summarize the day's new business data. The summary results are written directly to the backend storage system, bypassing the cache.

Configuration: You can use the default Fluid Dataset configuration or explicitly set it to read-only, and execute data prefetch at regular intervals to synchronize data changes from the backend storage system.

Example configuration:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  # accessModes: ["ReadOnlyMany"] ReadOnlyMany is the default value.
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: demo-dataset-warmup
spec:
  ...
  policy: Cron
  schedule: "0 0 * * *" # Prefetch data at 00:00 every day.
  loadMetadata: true # Sync data changes from the backend storage system during data prefetch.
  target:
  - path: /path/to/warmup # Specifies the path in the backend storage system that needs to be prefetched.

By setting accessModes: ["ReadOnlyMany"], you ensure that the dataset is used primarily for read operations. This means that business data is written directly to the backend storage system without interfering with the cache layer, which ensures data consistency and integrity.
Setting policy: Cron and schedule: "0 0 * * *" ensures that data prefetch operations are automatically executed at midnight every day. The loadMetadata: true setting ensures that metadata is also synchronized during the prefetch process.

Scenario 3: Data is read-only but backend storage data changes based on business events

Use case: A model inference service lets you upload a custom AI model and use it for inference. The uploaded AI model is written directly to the backend storage system without being cached. After the model is successfully uploaded, you can select it to perform inference and view the results.

Configuration: You can use the default Fluid Dataset configuration or explicitly set it to read-only. Set the file metadata timeout for the Runtime Filesystem in Userspace (FUSE) to a small value, and disable the server-side metadata cache of the cache system.

Example configuration:

Use JindoRuntime as the cache runtime:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  mounts:
    - mountPoint: <MOUNTPOINT>
      name: data
      path: /
      options:
        metaPolicy: ALWAYS # By setting metaPolicy: ALWAYS, the server-side metadata cache is disabled.
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: demo-dataset
spec:
  fuse:
    args:
      - -oauto_cache
      # Set the metadata timeout to 30s. If xxx_timeout=0, strong consistency can be provided, but data read efficiency may be greatly reduced.
      - -oattr_timeout=30 
      - -oentry_timeout=30
      - -onegative_timeout=30
      - -ometrics_port=0

By setting metaPolicy: ALWAYS, the server-side metadata cache is disabled. This ensures that each access directly queries the backend storage to obtain the latest metadata, which is suitable for application scenarios that require stronger consistency.

Scenario 4: Read and write requests are in different directories

Use case: In large-scale distributed AI training, a training job reads data samples from a dataset in directory A and writes model checkpoints to directory B after each epoch. Because model checkpoints can be large, caching the writes improves efficiency.

Configuration: Create two Fluid Datasets. Set the access mode of one to read-only and the other to read-write. Mount the two Fluid Datasets to directory A and directory B, respectively, to achieve read/write splitting.

Example configuration:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: train-samples
spec:
  ...
  # accessModes: ["ReadOnlyMany"] ReadOnlyMany is the default value.
---
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: model-ckpt
spec:
  ...
  accessModes: ["ReadWriteMany"]

Setting the training dataset train-samples to the read-only access mode accessModes: ["ReadOnlyMany"] (the default setting) ensures that all training jobs can only read data from this directory. This ensures the immutability of the data source and avoids potential consistency issues introduced by write operations during training. Meanwhile, the model checkpoint directory (model-ckpt) is configured with the read-write mode (accessModes: ["ReadWriteMany"]). This allows the training job to safely write model checkpoints after each iteration and improves write efficiency.

An example of an application pod definition is as follows:

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  containers:
  ...
    volumeMounts:
    - name: train-samples-vol
      mountPath: /data/A
    - name: model-ckpt-vol
      mountPath: /data/B
  volumes:
    - name: train-samples-vol
      persistentVolumeClaim:
        claimName: train-samples
    - name: model-ckpt-vol
      persistentVolumeClaim:
        claimName: model-ckpt

By mounting two different volumes (train-samples-vol and model-ckpt-vol) to specified paths (/data/A and /data/B), this achieves physical isolation between the training dataset and the model checkpoint directory.

Scenario 5: Read and write requests must be in the same directory

Use case: In an interactive programming development scenario, such as online Jupyter Notebook or VS Code development, you have a personal workspace directory. This workspace directory stores files such as datasets and code, and you may frequently add, delete, or modify these files.

Configuration: Set the Dataset access mode to read-write. We recommend using a storage system with full Portable Operating System Interface (POSIX) compatibility as the backend implementation.

Example configuration:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: myworkspace
spec:
  ...
  accessModes: ["ReadWriteMany"]

Setting accessModes: ["ReadWriteMany"] ensures that multiple users or processes can concurrently read from and write to the same workspace directory.