Enable memory caching for a local directory

Updated at: 2025-04-02 07:50

During model inference, model files that are stored in Object Storage Service (OSS) or Apsara File Storage NAS (NAS) are mounted to a local directory. Access to the model files is affected by network bandwidth and latency. To resolve this issue, Elastic Algorithm Service (EAS) provides memory caching for local directories. EAS caches model files from disks to the memory to accelerate data reading and reduce latency. This topic describes how to configure memory caching for a local directory and shows the acceleration performance.

Background information

In most model inference scenarios, the EAS service mounts model files to a local directory by using OSS, NAS, or a Docker image. For more information, see Mount storage to services. When you perform operations, such as reading models, switching models, and scaling containers, the operations may be affected by the network bandwidth. In common scenarios such as scenarios in which Stable Diffusion is used, inference requests involve frequent switchover between basic models and Low-Rank Adaptation (LoRA) models. Each switchover requires reading models from OSS or NAS, which increases network latency.

To resolve this issue, EAS provides memory caching for local directories. The following figure shows the principles.

image
  • You can cache model files in a local directory to memory.

  • Caching supports the least recently used (LRU) policy and allows you to share files among instances. The cached files are displayed as the file system directory.

  • The service directly reads the cached files in the memory. You can accelerate the operation without the need to modify business code.

  • The instances of a service share a P2P network. When you scale out a service cluster, the added instances can read cached files from nearby instances over the P2P network to accelerate cluster scale-out.

Precautions

  • To ensure data consistency, the cache directory is read-only.

  • If you want to add a model file, you can add the model file to the source directory. The model file can be directly read from the cache directory.

  • We recommend that you do not directly modify or delete model files in the source directory. Otherwise, dirty data may be cached.

Procedure

In this example, Stable Diffusion is used. The following parameters are configured for service deployment:

  • Image startup command: ./webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-prepare-environment --api --filebrowser

  • OSS directory in which model files are stored: oss://path/to/models/

  • Directory in which model files are stored in a container: /code/stable-diffusion-webui/data-slow

The /code/stable-diffusion-webui/data-slow directory is the source directory in which model files are stored. Then, the model files are mounted to the cache directory /code/stable-diffusion-webui/data-fast. This way, the service can directly read the model files that are stored in the source directory /code/stable-diffusion-webui/data-slow from the cache directory /code/stable-diffusion-webui/data-fast.

Configure memory caching in the PAI console
Configure memory caching on an on-premises client
  1. Log on to the PAI console. Select a region in the top navigation bar. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

  2. Click Deploy Service and select Custom Deployment in the Custom Model Deployment section.

  3. On the Custom Deployment page, configure the parameters described in the following table. For information about other parameters, see Deploy a model service in the PAI console.

    Parameter

    Description

    Example

    Environment Information

    Model Settings

    Select the OSS mount mode.

    • OSS: oss://path/to/models/

    • Mount Path: /code/stable-diffusion-webui/data-slow

    Command

    Configure the startup parameter based on the image that you use or the code that you write.

    For the Stable Diffusion service, you must add the --ckpt-dir parameter to the startup command and set it to the cache directory.

    ./webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-prepare-environment --api --filebrowser --ckpt-dir /code/stable-diffusion-webui/data-fast

    Features

    Memory Caching

    Turn on Memory Caching and configure the following parameters:

    • Maximum Memory Usage: the maximum memory used by cached files. Unit: GB. When the maximum memory usage is exceeded, the LRU policy is used to remove specific cached files.

    • Source Path: The directory can be one of the following directories: the source directory in which cached files are stored, the cache directory in a container in which the cached files in OSS or NAS are mounted, a subdirectory of the cache directory, and a common file directory in the container.

    • Mount Path: the cache directory to which the cached files are mounted. The service must read the files from the cache directory.

    • Maximum Memory Usage: 20 GB

    • Source Path: /code/stable-diffusion-webui/data-slow

    • Mount Path: /code/stable-diffusion-webui/data-fast

  4. After you configure the parameters, click Deploy.

  1. Prepare a configuration file. Sample code:

   {
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ml.gu7i.c16m60.1-gu30"
                }
            ]
        }
    },
    "containers": [
        {
            "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/stable-diffusion-webui:4.2",
            "port": 8000,
            "script": "./webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-prepare-environment --api --filebrowser --ckpt-dir /code/stable-diffusion-webui/data-fast"
        }
    ],
    "metadata": {
        "cpu": 16,
        "enable_webservice": true,
        "gpu": 1,
        "instance": 1,
        "memory": 60000,
        "name": "sdwebui_test"
    },
    "options": {
        "enable_cache": true
    },
    "storage": [
        {
            "cache": {
                "capacity": "20G",
                "path": "/code/stable-diffusion-webui/data-slow"
            },
            "mount_path": "/code/stable-diffusion-webui/data-fast"
        },
        {
            "mount_path": "/code/stable-diffusion-webui/data-slow",
            "oss": {
                "path": "oss://path/to/models/",
                "readOnly": false
            },
            "properties": {
                "resource_type": "model"
            }
        }
    ]
}

The following table describes the parameters. For information about other parameters, see Parameters of model services.

Parameter

Description

script

Configure the startup parameter based on the image that you use or the code that you write. For the Stable Diffusion service, you must add the --ckpt-dir parameter to the startup command and set it to the cache directory.

cache

capacity

The maximum memory used by cached files. Unit: GB. When the maximum memory usage is exceeded, the LRU policy is used to remove specific cached files.

path

The source directory of cached files. The directory can be one of the following directories: the cache directory in the container in which the cached files in OSS or NAS are mounted, a subdirectory of the cache directory, and a common file directory in the container.

mount_path

The cache directory to which cached files are mounted. The files in this directory are the same as the files in the source directory. The service must read the files from this directory.

Step 2: Deploy a model service

Deploy a model service and enable memory caching for the model service.

Deploy a model service in the PAI console
Deploy a model service by using the EASCMD client
  1. Log on to the PAI console. Select a region in the top navigation bar. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. On the Deploy Service page, click JSON Deployment in the Custom Model Deployment section. In the editor field, enter the configuration code that you prepared in Step 1.

  3. Click Deploy.

  1. Download the EASCMD client and perform identity authentication. For more information, see Download the EASCMD client and complete identity authentication.

  2. Create a JSON file named test.json in the directory in which the EASCMD client is stored by following the instructions in Step 1.

  3. Run the following command in the directory in which the JSON file is stored. In this example, Windows 64 is used.

    eascmdwin64.exe create <test.json>

Acceleration effect

The following table describes the acceleration effect in the sample model switching scenario in which Stable Diffusion is used. Unit: seconds. The acceleration effect varies based on the actual situation.

Model

Model size

Mount OSS path

Local instance memory hit

Remote instance memory hit

Model

Model size

Mount OSS path

Local instance memory hit

Remote instance memory hit

anything-v4.5.safetensors

7.2G

89.88

3.845

15.18

Anything-v5.0-PRT-RE.safetensors

2.0G

16.73

2.967

5.46

cetusMix_Coda2.safetensors

3.6G

24.76

3.249

7.13

chilloutmix_NiPrunedFp32Fix.safetensors

4.0G

48.79

3.556

8.47

CounterfeitV30_v30.safetensors

4.0G

64.99

3.014

7.94

deliberate_v2.safetensors

2.0G

16.33

2.985

5.55

DreamShaper_6_NoVae.safetensors

5.6G

71.78

3.416

10.17

pastelmix-fp32.ckpt

4.0G

43.88

4.959

9.23

revAnimated_v122.safetensors

4.0G

69.38

3.165

3.20

  • If no model files exist in the memory cache, CacheFS automatically reads model files from the source directory. For example, if a file is mounted to an OSS bucket, CacheFS reads the file from the OSS bucket.

  • If the service has multiple instances, the instance cluster shares the memory. When an instance loads the model, the instance can directly read the model files from the memory of other instances in the cluster. The amount of time required to read the file varies based on the file size.

  • When you scale out a service cluster, the additional instances can automatically read the model files from the memory of other instances in the cluster during initialization. This way, the service is scaled out in a faster and more flexible manner.

  • On this page (1)
  • Background information
  • Precautions
  • Procedure
  • Acceleration effect
Feedback