During model inference, model files that are stored in Object Storage Service (OSS) or Apsara File Storage NAS (NAS) are mounted to a local directory. Access to the model files is affected by network bandwidth and latency. To resolve this issue, Elastic Algorithm Service (EAS) provides memory caching for local directories. EAS caches model files from disks to the memory to accelerate data reading and reduce latency. This topic describes how to configure memory caching for a local directory and shows the acceleration performance.
Background information
In most model inference scenarios, the EAS service mounts model files to a local directory by using OSS, NAS, or a Docker image. For more information, see Mount storage to services. When you perform operations, such as reading models, switching models, and scaling containers, the operations may be affected by the network bandwidth. In common scenarios such as scenarios in which Stable Diffusion is used, inference requests involve frequent switchover between basic models and Low-Rank Adaptation (LoRA) models. Each switchover requires reading models from OSS or NAS, which increases network latency.
To resolve this issue, EAS provides memory caching for local directories. The following figure shows the principles.
You can cache model files in a local directory to memory.
Caching supports the least recently used (LRU) policy and allows you to share files among instances. The cached files are displayed as the file system directory.
The service directly reads the cached files in the memory. You can accelerate the operation without the need to modify business code.
The instances of a service share a P2P network. When you scale out a service cluster, the added instances can read cached files from nearby instances over the P2P network to accelerate cluster scale-out.
Precautions
To ensure data consistency, the cache directory is read-only.
If you want to add a model file, you can add the model file to the source directory. The model file can be directly read from the cache directory.
We recommend that you do not directly modify or delete model files in the source directory. Otherwise, dirty data may be cached.
Procedure
In this example, Stable Diffusion is used. The following parameters are configured for service deployment:
Image startup command:
./webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-prepare-environment --api --filebrowser
OSS directory in which model files are stored:
oss://path/to/models/
Directory in which model files are stored in a container:
/code/stable-diffusion-webui/data-slow
The /code/stable-diffusion-webui/data-slow
directory is the source directory in which model files are stored. Then, the model files are mounted to the cache directory /code/stable-diffusion-webui/data-fast
. This way, the service can directly read the model files that are stored in the source directory /code/stable-diffusion-webui/data-slow
from the cache directory /code/stable-diffusion-webui/data-fast
.
Log on to the PAI console. Select a region in the top navigation bar. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service and select Custom Deployment in the Custom Model Deployment section.
On the Custom Deployment page, configure the parameters described in the following table. For information about other parameters, see Deploy a model service in the PAI console.
Parameter
Description
Example
Environment Information
Model Settings
Select the OSS mount mode.
OSS:
oss://path/to/models/
Mount Path:
/code/stable-diffusion-webui/data-slow
Command
Configure the startup parameter based on the image that you use or the code that you write.
For the Stable Diffusion service, you must add the
--ckpt-dir
parameter to the startup command and set it to the cache directory../webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-prepare-environment --api --filebrowser --ckpt-dir /code/stable-diffusion-webui/data-fast
Features
Memory Caching
Turn on Memory Caching and configure the following parameters:
Maximum Memory Usage: the maximum memory used by cached files. Unit: GB. When the maximum memory usage is exceeded, the LRU policy is used to remove specific cached files.
Source Path: The directory can be one of the following directories: the source directory in which cached files are stored, the cache directory in a container in which the cached files in OSS or NAS are mounted, a subdirectory of the cache directory, and a common file directory in the container.
Mount Path: the cache directory to which the cached files are mounted. The service must read the files from the cache directory.
Maximum Memory Usage: 20 GB
Source Path:
/code/stable-diffusion-webui/data-slow
Mount Path:
/code/stable-diffusion-webui/data-fast
After you configure the parameters, click Deploy.
Prepare a configuration file. Sample code:
{
"cloud": {
"computing": {
"instances": [
{
"type": "ml.gu7i.c16m60.1-gu30"
}
]
}
},
"containers": [
{
"image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/stable-diffusion-webui:4.2",
"port": 8000,
"script": "./webui.sh --listen --port 8000 --skip-version-check --no-hashing --no-download-sd-model --skip-prepare-environment --api --filebrowser --ckpt-dir /code/stable-diffusion-webui/data-fast"
}
],
"metadata": {
"cpu": 16,
"enable_webservice": true,
"gpu": 1,
"instance": 1,
"memory": 60000,
"name": "sdwebui_test"
},
"options": {
"enable_cache": true
},
"storage": [
{
"cache": {
"capacity": "20G",
"path": "/code/stable-diffusion-webui/data-slow"
},
"mount_path": "/code/stable-diffusion-webui/data-fast"
},
{
"mount_path": "/code/stable-diffusion-webui/data-slow",
"oss": {
"path": "oss://path/to/models/",
"readOnly": false
},
"properties": {
"resource_type": "model"
}
}
]
}
The following table describes the parameters. For information about other parameters, see Parameters of model services.
Parameter | Description | |
script | Configure the startup parameter based on the image that you use or the code that you write. For the Stable Diffusion service, you must add the | |
cache | capacity | The maximum memory used by cached files. Unit: GB. When the maximum memory usage is exceeded, the LRU policy is used to remove specific cached files. |
path | The source directory of cached files. The directory can be one of the following directories: the cache directory in the container in which the cached files in OSS or NAS are mounted, a subdirectory of the cache directory, and a common file directory in the container. | |
mount_path | The cache directory to which cached files are mounted. The files in this directory are the same as the files in the source directory. The service must read the files from this directory. |
Step 2: Deploy a model service
Deploy a model service and enable memory caching for the model service.
Log on to the PAI console. Select a region in the top navigation bar. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. On the Deploy Service page, click JSON Deployment in the Custom Model Deployment section. In the editor field, enter the configuration code that you prepared in Step 1.
Click Deploy.
Download the EASCMD client and perform identity authentication. For more information, see Download the EASCMD client and complete identity authentication.
Create a JSON file named
test.json
in the directory in which the EASCMD client is stored by following the instructions in Step 1.Run the following command in the directory in which the JSON file is stored. In this example, Windows 64 is used.
eascmdwin64.exe create <test.json>
Acceleration effect
The following table describes the acceleration effect in the sample model switching scenario in which Stable Diffusion is used. Unit: seconds. The acceleration effect varies based on the actual situation.
Model | Model size | Mount OSS path | Local instance memory hit | Remote instance memory hit |
Model | Model size | Mount OSS path | Local instance memory hit | Remote instance memory hit |
anything-v4.5.safetensors | 7.2G | 89.88 | 3.845 | 15.18 |
Anything-v5.0-PRT-RE.safetensors | 2.0G | 16.73 | 2.967 | 5.46 |
cetusMix_Coda2.safetensors | 3.6G | 24.76 | 3.249 | 7.13 |
chilloutmix_NiPrunedFp32Fix.safetensors | 4.0G | 48.79 | 3.556 | 8.47 |
CounterfeitV30_v30.safetensors | 4.0G | 64.99 | 3.014 | 7.94 |
deliberate_v2.safetensors | 2.0G | 16.33 | 2.985 | 5.55 |
DreamShaper_6_NoVae.safetensors | 5.6G | 71.78 | 3.416 | 10.17 |
pastelmix-fp32.ckpt | 4.0G | 43.88 | 4.959 | 9.23 |
revAnimated_v122.safetensors | 4.0G | 69.38 | 3.165 | 3.20 |
If no model files exist in the memory cache, CacheFS automatically reads model files from the source directory. For example, if a file is mounted to an OSS bucket, CacheFS reads the file from the OSS bucket.
If the service has multiple instances, the instance cluster shares the memory. When an instance loads the model, the instance can directly read the model files from the memory of other instances in the cluster. The amount of time required to read the file varies based on the file size.
When you scale out a service cluster, the additional instances can automatically read the model files from the memory of other instances in the cluster during initialization. This way, the service is scaled out in a faster and more flexible manner.