Best practices for the cache optimization policies of Fluid - Container Service for Kubernetes

In Kubernetes clusters that use an architecture where computing is decoupled from storage, you can use the caching capabilities of Fluid to effectively reduce access latency and resolve bandwidth throttling when you access the storage system. This helps improve the data processing efficiency of the clusters. This topic describes how to use the cache optimization policies of Fluid to improve the performance, stability, and read/write consistency of data access.

Use scenarios

Data caching technologies accelerate data access based on the principle of locality. Data caching technologies are suitable for data-intensive applications that feature data locality. Scenarios:

AI model training applications. During the training procedure of an AI model, training datasets are read per epoch. The datasets are used for model iteration and convergence.
AI model inference startups. When you deploy an AI model as an inference service or update an inference service, a large number of inference service instances are concurrently launched. Each instance concurrently reads the same mode file from the storage system and loads the model file to the GPU memory allocated to the instance.
Big data analytics scenarios. Data may be shared by multiple data analysis tasks. For example, when you use SparkSQL to analyze user profiles and product information, order data is used by both analysis tasks.

In the preceding scenarios, you can use the cache optimization policies of Fluid to improve data processing efficiency.

Cache optimization policies for performance improvements

When you use Fluid to build a cache system to accelerate data access, you can configure the following settings based on your business requirements and budget: Elastic Compute Service (ECS) instance types, cache media, cache system parameters, and cache management policies. In addition, you need to consider the access mode of clients, which may impact the performance of data reads. This section describes how to configure cache optimization policies to improve the performance of Fluid. You can combine the policies on demand.

Policy 1: Select proper ECS instance types

Performance estimation

A distributed cache system can aggregate the storage and bandwidth resources on different nodes and provide larger cache capacity and higher bandwidth for upper-layer applications. You can calculate the cache capacity of the cache system, the cache bandwidth, and the maximum bandwidth of a pod based on the following formulas:

Cache capacity of the cache system= Cache capacity of each cache system worker pod × Number of cache system worker pods
Cache bandwidth = Number of cache system worker pods × min{Maximum ECS bandwidth of worker pods, I/O throughput of the cache media used by worker pods} The cache bandwidth is not equal to the actual bandwidth for cache access. The actual bandwidth depends on the bandwidth of the client ECS instance and the access mode, such as sequential read or random read.
Maximum bandwidth for a pod = min{Bandwidth of the hosting ECS instance, Cache bandwidth} When multiple pods concurrently access the cache, the cache bandwidth is shared by the pods.

Example

Assume that two ECS instances of the ecs.g7nex.8xlarge type are added to your Container Service for Kubernetes (ACK) cluster to build a distributed cache system. The cache system consists of two worker pods each of which provides 100 GiB of memory cache. The pods run on different ECS instances. An application pod is deployed on an ECS instance of the ecs.gn7i-c8g1.2xlarge type. The instance provides 8 vCPUs, 30 GiB of memory, and 16 Gbit/s of bandwidth. The cache capacity of the cache system, cache bandwidth, and maximum pod bandwidth can be calculated based on the following formulas:

Cache capacity of the cache system = 100 GiB × 2 = 200 GiB
Cache bandwidth = 2 × min{40 Gbit/s, Memory I/O throughput} = 80 Gbit/s
When a cache hit occurs, the maximum pod bandwidth = min{80 Gbit/s, 16 Gbit/s} = 16 Gbit/s

Recommended ECS instance types

The bandwidth of a distributed cache cluster depends on the maximum bandwidth of ECS instances in the cluster and the cache media used by the cluster. To improve the performance of the distributed cache system, we recommend that you select ECS instances that provide high bandwidth and use memory, local hard disk drives (HDDs), and local SSDs as the cache media. We recommend that you select the following ECS instance types.

Instance family	Instance type	Instance specification
g7nex, network-enhanced general-purpose instance family	ecs.g7nex.8xlarge	32 vCPUs, 128 GiB of memory, and 40 Gbit/s of bandwidth
	ecs.g7nex.16xlarge	64 vCPUs, 256 GiB of memory, and 80 Gbit/s of bandwidth
	ecs.g7nex.32xlarge	128 vCPUs, 512 GiB of memory, and 160 Gbit/s of bandwidth
i4g, instance family with local SSDs	ecs.i4g.16xlarge	64 vCPUs, 256 GiB of memory, 2 local SSDs that are 1,920 GB in size, and 32 Gbit/s of bandwidth
i4g, instance family with local SSDs	ecs.i4g.32xlarge	128 vCPUs, 512 GiB of memory, 4 local SSDs that are 1,920 GB in size, and 64 Gbit/s of bandwidth
g7ne, network-enhanced general-purpose instance family	ecs.g7ne.8xlarge	32 vCPUs, 128 GiB of memory, and 25 Gbit/s of bandwidth
	ecs.g7ne.12xlarge	48 vCPUs, 192 GiB of memory, and 40 Gbit/s of bandwidth
	ecs.g7ne.24xlarge	96 vCPUs, 384 GiB of memory, and 80 Gbit/s of bandwidth
g8i, general-purpose instance family	ecs.g8i.24xlarge	96 vCPUs, 384 GiB of memory, and 50 Gbit/s of bandwidth
g8i, general-purpose instance family	ecs.g8i.16xlarge	64 vCPUs, 256 GiB of memory, and 32 Gbit/s of bandwidth

For more information about ECS instance families, see Overview of instance families.

Policy 2: Select proper cache media

The bandwidth of a distributed cache cluster depends on the maximum bandwidth of ECS instances in the cluster and the cache media used by the cluster. To improve the performance of the distributed cache system, we recommend that you select ECS instances that provide high bandwidth and use memory or local SSDs as the cache media. Fluid allows you to use the spec.tieredstore parameter of the cache runtime to configure different cache media and corresponding cache capacities.

Note

In most cases, Enterprise SSDs (ESSDs) do not meet the requirements of data-intensive applications for data access. For example, a disk of performance level (PL) 2 provides a throughput of up to 750 MB/s. If you attach an ESSD of PL2 to an ECS instance to cache data, the maximum bandwidth of the ECS instance is 750 MB/s, even though the instance type provides bandwidth higher than 750 MB/s.

Use memory to cache data

If you use memory to cache data, configure the spec.tieredstore parameter of the cache runtime based on the following code block:

spec:
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 30Gi # The cache capacity provided by each worker pod of the distributed cache system. 
        high: "0.99"
        low: "0.95"

Use local disks to cache data

If you use a local system disk to cache data, configure the spec.tieredstore parameter of the cache runtime based on the following code block:

spec:
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: emptyDir # Use emptyDir to ensure that the lifecycle of cached data aligns with the worker pods of the distributed cache system. This prevents any residual cached data.
        path: /var/lib/fluid/cache
        quota: 100Gi # The cache capacity provided by each worker pod of the distributed cache system.
        high: "0.99"
        low: "0.95

If you use an attached SSD to cache data, configure the spec.tieredstore parameter of the cache runtime based on the following code block:

spec:
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: hostPath
        path: /mnt/disk1 # /mnt/disk1 is the path of the local SSD on the host. 
        quota: 100Gi # The cache capacity provided by each worker pod of the distributed cache system. 
        high: "0.99"
        low: "0.95"

If you want to use multiple SSDs to cache data, configure the spec.tieredstore parameter based on the following code block:

spec:
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: hostPath
        path: /mnt/disk1,/mnt/disk2 # /mnt/disk1 and /mnt/disk2 are the paths of the data disks on the host. 
        quota: 100Gi # The cache capacity provided by each worker pod of the distributed cache system. The cache capacity is evenly distributed across multiple paths, such as /mnt/disk1 and mnt/disk2 are each allocated 50Gi of capacity. 
        high: "0.99"
        low: "0.95"

Policy 3: Schedule pods based on cache affinity

When a cache hit occurs, application pods read data from the cache system. In this case, if the pod and the cache system are deployed in different zones, the pod needs to access cache across zones. Network jitters may occur during cross-zone access. To avoid network jitters, you can configure affinity rules between the application pod and the cache system pods. Details:

Configure affinity rules to deploy the worker pods of the cache system in the same zone.
Configure affinity rules to deploy application pods in the zone where the worker pods of the cache system are deployed.

Important

Deploying multiple application pods and the worker pods of the cache system in a single zone may downgrade the disaster recovery capability of the application. We recommend that you balance the service performance and service availability based on the service-level agreement (SLA) of your service.

Fluid allows you to use the spec.nodeAffinity parameter of a dataset to specify an affinity rule for the worker pods of the cache system. Sample code:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - <ZONE_ID> # The ID of the zone where you want to deploy the worker pods of the cache system. Example: cn-beijing-i.

The preceding code block can be used to schedule the worker pods of the cache system on ECS instances in the specified zone (zone ID: <ZONE_ID>).

In addition, Fluid automatically injects the cache affinity rule into the application pod specification. Based on the affinity rule, the system attempts to schedule the application pods to the zone where the worker pods of the cache system are deployed. For more information, see Schedule pods based on cache affinity.

Policy 4: Optimize cache system parameters for large file full sequential reads

Full data sequential reads on large files are used in data-intensive scenarios. For example, full data sequential reads on large files are required when you use datasets of the TFRecord or TAR format to train models, load one or more model files during the startup of an inference service, or read data from Parquet files to implement distributed data analytics. In the preceding scenarios, you can enable prefetching to accelerate data access. For example, you can increase the prefetching concurrency and the size of prefetched data.

The parameters that you can use to configure prefetching vary based on the cache runtime that you use in Fluid.

JindoRuntime

When you use JindoRuntime, you can configure the spec.fuse.properties parameter to customize the behavior and performance of Jindo Fuse.

kind: JindoRuntime
metadata:
  ...
spec:
  fuse:
    properties:
      fs.oss.download.thread.concurrency: "200"
      fs.oss.read.buffer.size: "8388608"
      fs.oss.read.readahead.max.buffer.count: "200"
      fs.oss.read.sequence.ambiguity.range: "2147483647" # Approximately 2G.

fs.oss.download.thread.concurrency: The number of concurrent threads created for the Jindo client to prefetch data. The data prefetched by each thread is stored in a buffer.
fs.oss.read.buffer.size: The buffer size.
fs.oss.read.readahead.max.buffer.count: The maximum number of buffers for one prefetching thread.
fs.oss.read.sequence.ambiguity.range: The range based on which the Jindo client determines whether the process performs sequential reads.

JuiceFSRuntime

When you use JuiceFSRuntime, you can use the spec.fuse.options parameter to configure the FUSE component and the spec.worker.options parameter to configure the worker component.

kind: JuiceFSRuntime
metadata:
  ...
spec:
  fuse:
    options:
      buffer-size: "2048"
      cache-size: "0"
      max-uploads: "150"
  worker:
    options:
      buffer-size: "2048"
      max-downloads: "200"
      max-uploads: "150"
  master:
    resources:
      requests:
        memory: 2Gi
      limits:
        memory: 8Gi
  ...

buffer-size: The size of the read/write buffer.
max-downloads: The concurrency of downloads in the prefetching process.
fuse cache-size: The size of the local cache that can be used by the JuiceFS FUSE component.

For example, you can set the size of the local cache that can be used by the JuiceFS FUSE component to 0 and the memory request of the FUSE component (the requests parameter) to a small value, such as 2 GiB. In this case, the FUSE component automatically uses the page cache provided by the Linux kernel on the node. When no page cache hit occurs, the JuiceFS worker distributed cache is used to accelerate data access. For more information about how to optimize the performance and the relevant parameters, see JuiceFS official documentation.

Cache optimization policies for stability improvements

To ensure long-term stability for the Fluid cache system, we recommend that you do not mount directories that store a large number of files to the cache system. This prevents the cache system from being overloaded. You can persist metadata and use persistent storage such as ESSDs to enhance the resilience and fault recovery capabilities of the cache system. To prevent stability issues caused by resource contention, we recommend that you properly allocate resources to the FUSE pod. You can enable the auto repair feature of the FUSE component to allow the cache system to automatically respond to faults. You can combine the preceding policies to improve the stability and reliability of the Fluid cache system.

Policy 1: Do not mount large directories to the underlying data source

The cache system maintains the metadata of all files in the directories mounted to the cache system and records additional information about the files, such as the cache status. If the cache system is mounted with large directories, such as the root directories of large storage systems, the cache system consumes a large amount of memory resources to store the metadata of the files in the mounted directories and consumes additional CPU resources to process access to the metadata.

Fluid defines the dataset concept. A dataset is a collection of logically related data intended for upper-layer applications. A dataset corresponds to a distributed cache system. We recommend that you use a separate dataset to accelerate data access for a data-intensive application or relevant tasks. If you need to run multiple applications, we recommend that you create separate datasets for the applications and mount different subdirectories of the underlying data source to different datasets. A mounted subdirectory contains the data required by the application to which the subdirectory is mounted. This provides improved data isolation when different applications use the cache systems of the mounted datasets to access data. This also ensures the stability and performance of the cache systems. Sample dataset configurations:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  mounts:
  - mountPoint: oss://<BUCKET>/<PATH1>/<SUBPATH>/
    name: sub-bucket

Multiple cache systems may increase the complexity of maintenance. You can select flexible cache system architectures based on your business requirements:

If the datasets that you want to use are small in size, strongly rely on each other, stored in the same storage system, and contain a small number of files, we recommend that you create one Fluid dataset and one distributed cache system.
If the datasets that you want to use are large in size and contain a large number of files, we recommend that you create multiple Fluid datasets and multiple distributed cache systems. You can mount one or multiple Fluid datasets to your application pods.
If the datasets that you want to use belong to different storage systems of different users and data isolation between users is required, we recommend that you create separate short-lived Fluid datasets for each user or the relevant data processing jobs. You can develop Kubernetes operators to manage multiple Fluid datasets.

Policy 2: Persist metadata to improve cache system stability

Most distributed cache systems adopt a master-worker architecture. The master pod is responsible for maintaining the metadata of all files in the directories mounted to the cache system and recording the cache status of the files. When an application uses such a cache system to access a file, the application needs to first obtain the metadata of the file from the master pod and then use the metadata to retrieve the file from the underlying storage or the worker pod of the cache system. Therefore, the stability of the master pod is critical to the high availability of the cache system.

If you use JindoRuntime, you can use the following configurations to improve the availability of the master pod:

apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: sd-dataset
spec:
  ...
  volumes:
    - name: meta-vol
      persistentVolumeClaim:
        claimName: demo-jindo-master-meta
  master:
    resources:
      requests:
        memory: 4Gi
      limits:
        memory: 8Gi
    volumeMounts:
      - name: meta-vol
        mountPath: /root/jindofs-meta
    properties:
      namespace.meta-dir: "/root/jindofs-meta"

The preceding YAML file uses an existing persistent volume claim (PVC) named demo-jindo-master-meta, which is used to mount an ESSD as a volume. This means that the metadata maintained by the master pod is persisted and can be migrated together with the pod. For more information, see Use JindoRuntimes to persist storage for the JindoFS master.

Policy 3: Properly allocate resources to the FUSE pod

The cache system client runs in the FUSE pod, which mounts the FUSE file system to the specified path in the application pod on the host and uses POSIX (Portable Operating System Interface) to expose data. This way, you can accelerate data access from application pods without the need to modify the application code. The access to remote storage is as fast as the access to local storage. The FUSE pod is responsible for maintaining the data tunnel between the application and the cache system. We recommend that you properly allocate resources to the FUSE pod.

Fluid allows you to use the spec.fuse.resources parameter of the cache runtime to allocate resources to the FUSE pod. If out of memory (OOM) errors occur in the FUSE pod, the FUSE file system may be unmounted from the application pod. To prevent this issue, we recommend that you do not specify a memory limit for the FUSE pod or specify a high memory limit for the FUSE pod. For example, you can set the memory limit to a value close to the amount of memory provided by the ECS instance.

spec:
  fuse:
    resources:
      requests:
        memory: 8Gi
    # limits:
    #   memory: <ECS_ALLOCATABLE_MEMORY>

Policy 4: Enable auto repair for the FUSE component to improve client availability

The FUSE pod is responsible for maintaining the data tunnel between the application and the cache system. By default, if the FUSE pod crashes due to unexpected reasons, the application pod can no longer use the FUSE file system to access data, even though the FUSE pod is restarted to run as normal. To resolve this issue, you must restart the application pod to restore data access. When the pod is restarted, the FUSE file system is remounted to the pod. However, this may affect the availability of the application.

Fluid provides the auto repair feature for FUSE. This feature can automatically restore data access to the FUSE file system for the application pod within a specific time period after the FUSE pod is restarted. In addition, you do not need to restart the application pod.

Important

If the FUSE pod crashes, the FUSE file system remains inaccessible before the FUSE pod is restarted. Therefore, you must manually handle I/O errors before the FUSE pod is restarted to prevent the application from crashing. You can retry I/O operations until you succeed.

For more information about how to enable the auto repair feature of FUSE, see How to enable auto repair for FUSE. The auto repair feature of FUSE has certain limits. Therefore, we recommend that you enable this feature only in suitable scenarios, such as interactive program development scenarios where you develop applications by using online notebooks or other interactive programming environments.

Best practices for the read-write consistency of cache

A cache system accelerates data access for your application. However, read-write consistency issues may occur in cache. If you want to ensure a high read-write consistency of cache, the performance of data access will be compromised or the maintenance of the cache system will become complex. We recommend that you select proper policies to improve the read-write consistency of cache based on your business scenarios. This section describes how to configure the read-write consistency of cache in different scenarios.

Scenario 1: Data in the backend storage is read-only and static

Use cases: Each time you train an AI model, the model reads data from a static dataset and then runs iterative training jobs. After the training jobs are completed, the cached data is deleted.

Solutions: Create a Fluid dataset and use the default dataset configurations or explicitly set the dataset to read-only.

Examples:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  # accessModes: ["ReadOnlyMany"] ReadOnlyMany is the default value.

The accessModes: ["ReadOnlyMany"] configuration is used to mount the dataset in read-only mode. This prevents the dataset from being modified during training and simplifies data management and cache optimization policies.

Scenario 2: Data in the backend storage is read-only but periodically changes

Use cases: A cache system is deployed in a Kubernetes cluster and business data is collected to the backend storage on a daily basis. Data analysis tasks are run in the early morning every day to analyze and aggregate incremental business data. The aggregated data is directly written to the backend storage and is not cached.

Solutions: Create a Fluid dataset and use the default dataset configurations or explicitly set the dataset to read-only. Periodically prefetch data from the backend storage and synchronize data changes to the backend storage.

Examples:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  ...
  # accessModes: ["ReadOnlyMany"] ReadOnlyMany is the default value. 
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: demo-dataset-warmup
spec:
  ...
  policy: Cron
  schedule: "0 0 * * *" # Prefetch data at 00:00:00 (UTC+8) each day. 
  loadMetadata: true # Synchronize data changes to the backend storage when you prefetch data from the backend storage. 
  target:
  - path: /path/to/warmup # The path in the backend storage from which you want to prefetch data.

The accessModes: ["ReadOnlyMany"] configuration is used to mount the dataset in read-only mode. You can write data only in the backend storage and cannot write data in the cache. This ensures data consistency and integrity.
The policy: Cron and schedule: "0 0 * * *" configurations are used to automatically run prefetching tasks at 00:00:00 (UTC+8) each day. In addition, the loadMetadata: true configuration ensures that data changes are synchronized to the backend storage when you prefetch data from the backend storage.

Scenario 3: Data in the backend storage is read-only but changes based on business events

Use cases: A model inference service allows you to submit a custom AI model and use the model to run inference tasks. In this case, the custom model is directly written to the backend storage and is not cached. After the model is submitted, you can use the model to run inference tasks and view the inference results.

Solutions: Create a Fluid dataset and use the default dataset configurations or explicitly set the dataset to read-only. Set the timeout period of metadata to a small value for the FUSE pod of the cache runtime. Disable caching for server-side metadata in the cache system.

Examples:

Sample code block for JindoRuntime:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-dataset
spec:
  mounts:
    - mountPoint: <MOUNTPOINT>
      name: data
      path: /
      options:
        metaPolicy: ALWAYS # Specify metaPolicy: ALWAYS to disable caching for server-side metadata. 
 ---
 apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: demo-dataset
spec:
  fuse:
    args:
      - -oauto_cache
      # Set the timeout period of metadata to 30 seconds. If you specify xxx_timeout=0, strong data consistency is ensured. However, the data access efficiency is greatly downgraded. 
      - -oattr_timeout=30 
      - -oentry_timeout=30
      - -onegative_timeout=30
      - -ometrics_port=0

The metaPolicy: ALWAYS configuration is used to disable caching for server-side metadata. This ensures that the latest metadata is retrieved from the backend storage each time you access data, which is suitable for scenarios where strong data consistency is required.

Scenario 4: Use different directories for data reads and data writes

Use cases: In large-scale distributed AI training scenarios, training jobs read data from datasets in Directory A and write the model checkpoint to Directory B each time an epoch is completed. The checkpoint may be large in size. In this case, you can write data to cache to improve write efficiency.

Solutions: Create a write-only Fluid dataset and a read/write Fluid dataset. Mount the write-only dataset to Directory A and the read/write dataset to Directory B. This way, read/write splitting is achieved.

Examples:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: train-samples
spec:
  ...
  # accessModes: ["ReadOnlyMany"] ReadOnlyMany is the default value. 
---
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: model-ckpt
spec:
  ...
  accessModes: ["ReadWriteMany"]

Set the train-samples dataset to read-only by specifying accessModes: ["ReadOnlyMany"] in the dataset configurations. This way, training jobs can only read data from Directory A, which eliminates inconsistency issues caused by data writes. Set the model-ckpt dataset as read/write by specifying accessModes: ["ReadWriteMany"] in the dataset configurations. This way, the model checkpoint is written to Directory B each time an epoch is completed, which improves write efficiency.

Sample code block for the application:

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  containers:
  ...
    volumeMounts:
    - name: train-samples-vol
      mountPath: /data/A
    - name: model-ckpt-vol
      mountPath: /data/B
  volumes:
    - name: train-samples-vol
      persistentVolumeClaim:
        claimName: train-samples
    - name: model-ckpt-vol
      persistentVolumeClaim:
        claimName: model-ckpt

The train-samples-vol volume is mounted to the /data/A path and the model-ckpt-vol volume is mounted to the /data/B path. This isolates the dataset from the model checkpoint.

Scenario 5: Use the same directory for data reads and data writes

Use cases: In interactive program development scenarios where you develop applications by using online Jupyter notebooks or Visual Studio Code, you need to use a workspace directory. The workspace directory stores datasets and code. You may need to frequently add, delete, or modify files in the workspace directory.

Solutions: Create a read/write Fluid dataset and use a backend storage that is fully compatible with POSIX.

Examples:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: myworkspace
spec:
  ...
  accessModes: ["ReadWriteMany"]

The accessModes: ["ReadWriteMany"] configuration is used to ensure that multiple users or processes read and write the same workspace directory.