By Xining Wang
Model Service Mesh is an architectural pattern used to deploy and manage machine learning model services in a distributed environment. It offers a scalable, high-performance infrastructure for managing, deploying, and scheduling multiple model services, enabling better handling of model deployment, version management, routing, and load balancing of inference requests.
The core idea of Model Service Mesh is deploying models as scalable services and managing and routing these services using a mesh. This simplifies the management and operations of model services. It facilitates orchestrating and scaling model services, simplifying model deployment, scaling, and version management. Additionally, Model Service Mesh provides essential features like load balancing, auto-scaling, and fault recovery to ensure high availability and reliability of model services.
Models can be automatically scaled based on the inference request load, and load balancing can be performed efficiently. Model Service Mesh also offers advanced features such as traffic splitting, A/B testing, and canary release for better traffic control and management of model services. These features allow easy switching of traffic among different model versions and rolling back to specific model versions. Moreover, Model Service Mesh supports dynamic routing, enabling requests to be routed to appropriate model services based on their attributes, such as model type, data format, or other metadata.
Alibaba Cloud Service Mesh (ASM) provides a scalable, high-performance Model Service Mesh infrastructure for managing, deploying, and scheduling multiple model services. It helps in better handling of model deployment, version management, routing, and load balancing of inference requests. Model Service Mesh simplifies the deployment, management, and scaling of machine learning models while ensuring high availability, resiliency, and flexibility to meet diverse business needs.
Model Service Mesh, built on KServe ModelMesh, is optimized for large-scale, high-density, and frequently changing model use cases. It intelligently loads and unloads models into and from memory, striking a balance between responsiveness and computing efficiency.
Model Service Mesh provides the following features.
• Cache management
• Pods are managed as a distributed least recently used (LRU) cache.
• Copies of models are loaded and unloaded based on usage frequency and current request volumes.
• Intelligent placement and loading
• Model placement is balanced by both the cache age across the pods and the request load.
• Queues are used to handle concurrent model loads and minimize impact on runtime traffic.
• Resiliency
• Failed model loads are automatically retried in different pods.
• Operational simplicity
• Rolling model updates are handled automatically and seamlessly.
The following example shows how to deploy a model. Please refer to [1] for prerequisites.
Use the following YAML to create the Persistent Volume Claim (PVC) my-models-pvc in the ACK cluster.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-models-pvc
namespace: modelmesh-serving
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: alibabacloud-cnfs-nas
volumeMode: Filesystem
Run the following command:
kubectl get pvc -n modelmesh-serving
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
my-models-pvc Bound nas-379c32e1-c0ef-43f3-8277-9eb4606b53f8 1Gi RWX alibabacloud-cnfs-nas 2h
To use the new PVC, you must mount it as a volume to a Kubernetes pod, and then use that pod to upload the model files to a persistent volume.
Let's deploy a pvc-access pod and ask the Kubernetes controller to claim the persistent volume we requested earlier by specifying the claimName "my-models-pvc":
kubectl apply -n modelmesh-serving -f - <<EOF
---
apiVersion: v1
kind: Pod
metadata:
name: "pvc-access"
spec:
containers:
- name: main
image: ubuntu
command: ["/bin/sh", "-ec", "sleep 10000"]
volumeMounts:
- name: "my-pvc"
mountPath: "/mnt/models"
volumes:
- name: "my-pvc"
persistentVolumeClaim:
claimName: "my-models-pvc"
EOF
Check the status of our pvc-access pod. It should be running:
kubectl get pods -n modelmesh-serving | grep pvc-access
Expected output:
pvc-access 1/1 Running
Add the AI model to the persistent volume. In this example, the MNIST handwritten digit character recognition model trained with scikit-learn is used. A copy of the mnist-svm.joblib model file can be downloaded from the kserve/modelmesh-minio-examples[2] repo.
Run the following command to copy the mnist-svm.joblib model file to the /mnt/models folder in the pvc-access pod:
kubectl -n modelmesh-serving cp mnist-svm.joblib pvc-access:/mnt/models/
Run the following command to verify that the model exists on the persistent volume:
kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/
Expected output:
-rw-r--r-- 1 501 staff 344817 Oct 30 11:23 mnist-svm.joblib
Deploy a new inference service sklearn-mnist:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-mnist
namespace: modelmesh-serving
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: sklearn
storage:
parameters:
type: pvc
name: my-models-pvc
path: mnist-svm.joblib
Wait dozens of seconds (the length of waiting time depends on the image pulling speed), and the new inference service sklearn-mnist should be ready.
Run the following command:
kubectl get isvc -n modelmesh-serving
Expected output:
NAME URL READY
sklearn-mnist grpc://modelmesh-serving.modelmesh-serving:8033 True
Run the curl command to send an inference request to the sklearn-mnist model. The data array represents the grayscale values of the 64 pixels in the image scan of the digit to be classified.
MODEL_NAME="sklearn-mnist"
ASM_GW_IP="IP address of the ingress gateway"
curl -X POST -k "http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{"name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": {"fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}}]}'
The JSON response should look like the following, inferring that the scanned digit was an "8":
{
"modelName": "sklearn-mnist__isvc-3c10c62d34",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [
"1",
"1"
],
"contents": {
"int64Contents": [
"8"
]
}
}
]
}
Model Service Mesh (hereinafter referred to as ModelMesh) is optimized for the deployment of large-scale, high-density, and frequently changing model inference services. ModelMesh intelligently loads and unloads models to and from memory to strike a balance between responsiveness and computing.
By default, ModelMesh is integrated with the following model serving runtimes.
• Triton Inference Server developed by NVIDIA, applicable to frameworks such as TensorFlow, PyTorch, TensorRT, and ONNX.
• MLServer developed by Seldon, a Python-based server, applicable to frameworks such as SKLearn, XGBoost, and LightGBM.
• OpenVINO Model Server developed by Intel, applicable to frameworks such as Intel OpenVINO and ONNX.
• TorchServe developed by PyTorch, applicable to frameworks such as PyTorch, including the eager mode.
If the preceding model servers cannot meet your requirements, for example, you need to process custom logic for inference, or the framework required by your model is not supported by the preceding model servers, you can create a custom serving runtime to meet your requirements.
For more information, please refer to [3].
A Large language model (LLM) refers to a neural network language model that is capable of incorporating billions of parameters. Common LLMs include GPT-3, GPT-4, PaLM, and PaLM2. The following describes how to use Model Service Mesh to serve LLMs.
For prerequisites, please refer to [4].
Build a custom runtime to serve the Hugging Face LLM with prompt tuning configuration. In this example, the default values are set to the pre-built custom runtime image and pre-built prompt tuning configuration.
The peft_model_server.py file in the directory of kfp-tekton/samples/peft-modelmesh-pipeline[5] contains all the code on how the Hugging Face LLM with prompt tuning configuration is being served.
The following_load_model function shows that a pretrained LLM model with the PEFT prompt tuning configuration trained is selected. A tokenizer is also defined as part of the model to encode and decode raw string inputs from the inference requests without asking users to preprocess their input into tensor bytes.
from typing import List
from mlserver import MLModel, types
from mlserver.codecs import decode_args
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
class PeftModelServer(MLModel):
async def load(self) -> bool:
self._load_model()
self.ready = True
return self.ready
@decode_args
async def predict(self, content: List[str]) -> List[str]:
return self._predict_outputs(content)
def _load_model(self):
model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
config = PeftConfig.from_pretrained(peft_model_id)
self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
self.model = PeftModel.from_pretrained(self.model, peft_model_id)
self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
return
def _predict_outputs(self, content: List[str]) -> List[str]:
output_list = []
for input in content:
inputs = self.tokenizer(
f'{self.text_column} : {input} Label : ',
return_tensors="pt",
)
with torch.no_grad():
inputs = {k: v for k, v in inputs.items()}
outputs = self.model.generate(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
)
outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
output_list.append(outputs[0])
return output_list
After the model class is implemented, you need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime resource. You can refer to the following Dockerfile to build an image:
# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
# ...
# The custom `MLModel` implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/
# environment variables to be compatible with ModelMesh Serving
# these can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
MLSERVER_GRPC_PORT=8001 \
MLSERVER_HTTP_PORT=8002 \
MLSERVER_LOAD_MODELS_AT_STARTUP=false \
MLSERVER_MODEL_NAME=peft-model
# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer
CMD mlserver start ${MLSERVER_MODELS_DIR}
You can create a new ServingRuntime resource by using the following content and point it to the image you created.
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: peft-model-server
namespace: modelmesh-serving
spec:
supportedModelFormats:
- name: peft-model
version: "1"
autoSelect: true
multiModel: true
grpcDataEndpoint: port:8001
grpcEndpoint: port:8085
containers:
- name: mlserver
image: registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
env:
- name: MLSERVER_MODELS_DIR
value: "/models/_mlserver_models/"
- name: MLSERVER_GRPC_PORT
value: "8001"
- name: MLSERVER_HTTP_PORT
value: "8002"
- name: MLSERVER_LOAD_MODELS_AT_STARTUP
value: "true"
- name: MLSERVER_MODEL_NAME
value: peft-model
- name: MLSERVER_HOST
value: "127.0.0.1"
- name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
value: "-1"
- name: PRETRAINED_MODEL_PATH
value: "bigscience/bloomz-560m"
- name: PEFT_MODEL_ID
value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
# - name: "TRANSFORMERS_OFFLINE"
# value: "1"
# - name: "HF_DATASETS_OFFLINE"
# value: "1"
resources:
requests:
cpu: 500m
memory: 4Gi
limits:
cpu: "5"
memory: 5Gi
builtInAdapter:
serverType: mlserver
runtimeManagementPort: 8001
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
Run the following kubectl apply command to deploy the ServingRuntime resource. After creation, you can see the new custom runtime in your ModelMesh deployment.
To deploy a model by using the newly created runtime, you must create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models. It represents the logical endpoint of the model for serving inferences.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: peft-demo
namespace: modelmesh-serving
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: peft-model
runtime: peft-model-server
storage:
key: localMinIO
path: sklearn/mnist-svm.joblib
In the preceding code block, the InferenceService resource is named peft-demo and its model format is declared as peft-model, which is the same format as the example custom runtime created in the previous step. An optional field runtime is also passed, explicitly instructing ModelMesh to use the peft-model-server runtime to deploy this model.
Now you can run the curl command to send an inference request to the LLM service deployed in the previous step.
MODEL_NAME="peft-demo"
ASM_GW_IP="ASM Gateway IP address"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json
input.json in the curl command indicates the request data:
{
"inputs": [
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
}
]
}
The value of bytes_contents is the Base64 encoded content of the string Every day is a new beginning, filled with opportunities and hope.
The JSON response should look like the following, inferring that the scanned digit was an 8:
{
"modelName": "peft-demo__isvc-5c5315c302",
"outputs": [
{
"name": "output-0",
"datatype": "BYTES",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "str"
}
},
"contents": {
"bytesContents": [
"VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
]
}
}
]
}
The following code block shows the Base64-decoded content of bytesContents:
Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint
So far, it indicates that the inference request is performed on the LLM service as expected.
Alibaba Cloud Service Mesh offers a scalable and high-performance infrastructure for managing, deploying, and scheduling multiple model services. It provides a model service mesh solution that enables better management of model deployment, version control, routing, and load balancing of inference requests.
Try Service Mesh now: https://www.alibabacloud.com/product/servicemesh
[1] https://www.alibabacloud.com/help/en/asm/user-guide/multi-model-inference-service-using-model-service-mesh
[2] Warehouse of kserve/modelmesh-minio-examples: https://github.com/kserve/modelmesh-minio-examples/blob/main/sklearn/mnist-svm.joblib
[3] https://www.alibabacloud.com/help/en/asm/user-guide/customizing-the-model-runtime-using-the-model-service-mesh
[4] https://www.alibabacloud.com/help/en/asm/user-guide/services-for-the-large-language-model-llm
[5] Directory of kfp-tekton/samples/peft-modelmesh-pipeline: https://github.com/kubeflow/kfp-tekton
Slow Trace Diagnostics - ARMS Hotspots Code Analysis Feature
Observability | Best Practices for Using Prometheus to Monitor Memcached
208 posts | 12 followers
FollowAlibaba Clouder - February 22, 2021
Xi Ning Wang(王夕宁) - July 21, 2023
Alibaba Container Service - September 14, 2022
Alibaba Cloud Native - October 9, 2021
Alibaba Cloud Native - November 3, 2022
Alibaba Clouder - February 22, 2021
208 posts | 12 followers
FollowAlibaba Cloud Service Mesh (ASM) is a fully managed service mesh platform that is compatible with Istio.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreApply the latest Reinforcement Learning AI technology to your Field Service Management (FSM) to obtain real-time AI-informed decision support.
Learn MoreThis solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.
Learn MoreMore Posts by Alibaba Cloud Native