In most model inference scenarios, a service mounts model files that are stored in Object Storage Service (OSS) buckets or File Storage NAS (NAS) file systems to a local directory. When you perform operations, such as reading models, switching models, and scaling containers, the operations may be affected by the network bandwidth and delayed. To resolve this issue, Elastic Algorithm Service (EAS) provides memory caching for local directories. The system caches model files from disks to the memory to accelerate data reading and reduce latency. This topic describes how to configure memory caching for a local directory and the acceleration performance.
Background information
In most cases, model inference requires a long period of time to complete, especially in Stable Diffusion scenarios. To handle inference requests, the system must frequently switch between base models and LoRA models. The system reads model files from OSS or NAS each time the models are switched, which significantly increases latency. To resolve this issue, EAS provides memory caching for local directories. The following figure shows the principles.
In most model inference scenarios, the service mounts model files to a local directory by using OSS, NAS, or a Docker image. For information about how to mount storage to a service, see Mount storage to services (advanced).
EAS provides memory caching for local directories.
In most AI-generated content (AIGC) scenarios, the memory capacity is sufficient. You can cache model files in a local directory to memory by mounting the model files to the local directory.
Caching supports the least recently used (LRU) policy and allows you to share files among instances. The cached files are displayed as the file system directory.
The service directly reads the cached files in the memory. You can accelerate the operation without the need to modify business code.
The instances of a service share a P2P network. When you scale out a service cluster, the added instances can read cached files from other instances over the P2P network to accelerate the cluster scale-out.
Precautions
To ensure data consistency, the cache directory is read-only.
If you want to add a model file, you can add the model file to the source directory. The model file can be directly read from the cache directory.
We recommend that you do not directly modify or delete model files in the source directory. Otherwise, dirty data may be cached.
Procedure
Configure memory caching for a local directory in the PAI console
Go to the Elastic Algorithm Service (EAS) page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose .
On the Elastic Algorithm Service (EAS) page of the Platform for AI (PAI) console, click Deploy Service. On the Deploy Service page, click Custom Deployment in the Custom Model Deployment section.
On the Create Service page, configure the parameters described in the following table. For information about other parameters, see Deploy a model service in the PAI console.
Parameter
Description
Model Service Information
Specify Model Settings
Click Specify Model Settings to configure the model service. In this example, Mount OSS Path is used.
Select an OSS path, such as
oss://path/to/models/
.Configure a mount path, such as
/data-slow
. The mount path is a directory to which the OSS path is mounted in a container.
Command to Run
You need to set the startup parameter
-ckpt-dir
to a cache directory, such as-ckpt-dir /data-fast
.Service Configuration
Memory Caching
Click Memory Caching and configure the following parameters:
Maximum Memory Usage: the maximum memory occupied by cached files. Unit: GB. When the cached files exceed the maximum memory, the LRU policy is used to remove specific cached files.
Source Path: the source directory of cached files. The directory can be one of the following paths: the mount path in a container in which cached files in OSS or NAS are mounted, a subdirectory of the mount path, and a common file directory in the container. Example:
/data-slow
.Mount Path: the directory to which cached files are mounted. The files in this directory are the same as the files in the source directory. The service must read the files from the directory to which the cached files are mounted. Example:
/data-fast
.
The preceding sample configurations indicate that the OSS path is mounted to the
/data-slow
directory in the container, and the source directory/data-slow
is mounted to the cache directory/data-fast
by using the CacheFS component. This way, the service can directly read the cached files that are stored in the source directory/data-slow
from the cache directory/data-fast
.After you configure the parameters, click Deploy.
Configure memory caching for a local directory on an on-premises client
Step 1: Prepare a configuration file
Add the cache and mount_path parameters to a configuration file. The following example provides sample configurations. For information about how to mount storage to a service, see Mount storage to services (advanced).
"storage": [
{
"mount_path": "/data-slow",
"oss": {
"path": "oss://path/to/models/",
"readOnly": false
},
"properties": {
"resource_type": "model"
}
},
{
"cache": {
"capacity": "20G",
"path": "/data-slow"
},
"mount_path": "/data-fast"
}
]
Parameter description
Parameter
Description
cache
capacity
The maximum memory that cached files occupy. When the cached files exceed the maximum memory, the LRU policy is used to remove specific cached files.
path
The source directory of cached files. The directory can be one of the following paths: the mount path in a container in which cached files in OSS or NAS are mounted, a subdirectory of the mount path, and a common file directory in the container.
mount_path
The directory to which cached files are mounted. The files in this directory are the same as the files in the source directory. The service must read the files from the directory to which the cached files are mounted.
Configuration description
The preceding sample configurations indicate that the OSS path is mounted to the
/data-slow
directory in the container, and the source directory/data-slow
is mounted to the cache directory/data-fast
by using the CacheFS component. This way, the service can directly read the cached files that are stored in the source directory/data-slow
from the cache directory/data-fast
.
The following example shows how to configure memory caching for a Stable Diffusion model service. Modify the configurations based on your business requirements. For information about how to deploy the Stable Diffusion inference service, see Quickly deploy a Stable Diffusion API service in EAS.
{
"cloud": {
"computing": {
"instance_type": "ml.gu7i.c8m30.1-gu30"
}
},
"containers": [
{
"image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/stable-diffusion-webui:3.2",
"port": 8000,
"script": "./webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-install --filebrowser --api --no-read-lora-meta --disable-nsfw-censor --public-cache --ckpt-dir data-fast/models/Stable-diffusion"
}
],
"metadata": {
"cpu": 32,
"enable_webservice": true,
"gpu": 1,
"instance": 1,
"name": "sd_cache"
},
"storage": [
{
"mount_path": "/code/stable-diffusion-webui/data-oss",
"oss": {
"path": "oss://examplebucket/data-zf/"
},
"properties": {
"resource_type": "model"
}
},
{
"mount_path": "/code/stable-diffusion-webui/data-fast/models",
"cache": {
"path": "/code/stable-diffusion-webui/data-oss/models",
"capacity": "15G"
}
}
]
}
Take note of the following key parameters. For information about other parameters, see Parameters of model services.
script: Set the startup parameter
-ckpt-dir
to a cache directory. Example:-ckpt-dir data-fast/models/Stable-diffusion
.storage: the mount configuration. In the preceding code, the OSS path is mounted to the
data-oss
directory, and the source directorydata-oss/models
is mounted to the cache directorydata-fast/models
by using the CacheFS component. This way, the service can directly read the cached files in the source directorydata-oss
from the cache directorydata-fast/models/Stable-diffusion
.Set the OSS path to the path of the OSS bucket in which your model file is stored.
Step 2: Deploy a model service
Deploy a model service and enable memory caching for the model service.
Deploy a model service in the console
Go to the Create Service page. For more information, see Deploy a model service in the PAI console.
In the Configuration Editor section, click JSON Deployment. In the editor field, enter the configuration code that you prepared in Step 1.
Click Deploy.
Deploy a model service by using the EASCMD client
Download the EASCMD client and perform identity authentication. For more information, see Download the EASCMD client and complete identity authentication.
Create a JSON file named
test.json
in the directory where the EASCMD client resides by following the instructions in Step 1.Run the following command in the directory where the JSON file resides. In this example, Windows 64 is used.
eascmdwin64.exe create <test.json>
After you deploy the service, you can directly read the model files from the memory cache of the local directory to improve the efficiency of model reading, model switching, and instance scaling. For more information about the acceleration performance, see Acceleration performance.
Acceleration performance
The following table describes the acceleration performance in the model switching scenario by using Stable Diffusion as an example. Unit: seconds. The actual acceleration performance varies based on the actual situation.
Model | Model size | Mount OSS path | Local instance memory hit | Remote instance memory hit |
anything-v4.5.safetensors | 7.2G | 89.88 | 3.845 | 15.18 |
Anything-v5.0-PRT-RE.safetensors | 2.0G | 16.73 | 2.967 | 5.46 |
cetusMix_Coda2.safetensors | 3.6G | 24.76 | 3.249 | 7.13 |
chilloutmix_NiPrunedFp32Fix.safetensors | 4.0G | 48.79 | 3.556 | 8.47 |
CounterfeitV30_v30.safetensors | 4.0G | 64.99 | 3.014 | 7.94 |
deliberate_v2.safetensors | 2.0G | 16.33 | 2.985 | 5.55 |
DreamShaper_6_NoVae.safetensors | 5.6G | 71.78 | 3.416 | 10.17 |
pastelmix-fp32.ckpt | 4.0G | 43.88 | 4.959 | 9.23 |
revAnimated_v122.safetensors | 4.0G | 69.38 | 3.165 | 3.20 |
If no model files are found in the memory cache, the CacheFS automatically reads files from the source directory. For example, if a file is mounted from an OSS bucket, CacheFS reads the file from the OSS bucket. The time consumed is basically equivalent to the time consumed to read the file through OSS mounting.
If the service has multiple instances, the instance cluster shares the memory. When an instance loads the model, the instance can directly read the memory of other instances in the cluster. The time consumed to read the file depends on the file size.
When a service cluster is scaled out, the added instances can automatically read models from the memory of other instances in the cluster during initialization. This way, the service is scaled out in a faster and more flexible manner.