Multi-Model Inference Service using Model Service Mesh - Alibaba Cloud Service Mesh

When you need to run multiple machine learning models to perform inference, you can use Model Service Mesh (ModelMesh) to roll out and manage a multi-model inference service. ModelMesh is implemented based on KServe ModelMesh and optimized for high-scale, high-density, and frequently-changing model use cases. ModelMesh intelligently loads and unloads models to and from memory to strike a balance between responsiveness and computing. This simplifies the deployment and O&M of a multi-model inference service and improves inference efficiency and performance.

Prerequisites

A Container Service for Kubernetes (ACK) cluster is added to your Service Mesh (ASM) instance and your ASM instance is of version 1.18.0.134 or later.
An ingress gateway is created for the cluster. For more information, see Create an ingress gateway.

Note

In this example, an ASM ingress gateway is used as the gateway of the cluster. The default gateway name is ingressgateway, port 8008 is enabled, and the HTTP protocol is used.

Features

ModelMesh provides the following features.

Feature	Description
Cache management	Pods are optimized and managed automatically based on usage frequency and recency of access. Copies of models are loaded and unloaded based on usage frequency and current request volumes.
Intelligent placement and loading	Model placement is balanced by both the cache age across the pods and the request load. Queues are used to handle concurrent model loads and minimize impact on runtime traffic.
Resiliency	Failed model loads are automatically retried in different pods.
Operational simplicity	Rolling model updates are handled automatically and seamlessly.

Step 1: Enable the ModelMesh feature in ASM

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.
On the KServe on ASM page, click Enable KServe on ASM.
Note
KServe relies on CertManager, and the installation of KServe will automatically install the CertManager component. If you want to use a self-built CertManager, disable Automatically install the CertManager component in the cluster.
After KServe is enabled, use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to check whether a ServingRuntime resource is available:
```
kubectl get servingruntimes -n modelmesh-serving
```
Expected output:
```
NAME                DISABLED   MODELTYPE     CONTAINERS   AGE
mlserver-1.x                   sklearn       mlserver     1m
ovms-1.x                       openvino_ir   ovms         1m
torchserve-0.x                 pytorch-mar   torchserve   1m
triton-2.x                     keras         triton       1m
```
A ServingRuntime resource defines the templates for pods that can serve one or more particular model formats. Pods are automatically provisioned depending on the framework of the deployed model.
The following table describes the runtimes and model formats supported by ModelMesh. For more information, see Supported Model Formats. If these model servers cannot meet all of your specific requirements, you can create custom model serving runtimes. For more information, see Use ModelMesh to create a custom model serving runtime.
ServingRuntime
Supported model framework
mlserver-1.x
sklearn, xgboost, and lightgbm
ovms-1.x
openvino_ir, onnx
torchserve-0.x
pytorch-mar
triton-2.x
tensorflow, pytorch, onnx, and tensorrt

Step 2: Configure an ASM environment

Synchronize the modelmesh-serving namespace from the ACK cluster to the ASM instance. For more information, see Synchronize automatic sidecar proxy injection labels from a Kubernetes cluster on the data plane to an ASM instance. After synchronization, confirm that the modelmesh-serving namespace exists.
Create an Istio gateway for the ingress gateway.
1. Create a grpc-gateway.yaml file that contains the following content:
  Show the grpc-gateway.yaml file
```
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: grpc-gateway
  namespace: modelmesh-serving
spec:
  selector:
    istio: ingressgateway
  servers:
    - hosts:
        - '*'
      port:
        name: grpc
        number: 8008
        protocol: GRPC
```
2. Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to create an Istio gateway:
```
kubectl apply -f grpc-gateway.yaml
```

Create a virtual service.

Create a vs-modelmesh-serving-service.yaml file that contains the following content:

Show the vs-modelmesh-serving-service.yaml file

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: vs-modelmesh-serving-service
  namespace: modelmesh-serving
spec:
  gateways:
    - grpc-gateway
  hosts:
    - '*'
  http:
    - match:
        - port: 8008
      name: default
      route:
        - destination:
            host: modelmesh-serving
            port:
              number: 8033

Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to create a virtual service:
```
kubectl apply -f vs-modelmesh-serving-service.yaml
```

Configure the Google Remote Procedure Call (gRPC)-JSON transcoder.

Create a grpcjsontranscoder-for-kservepredictv2.yaml file that contains the following content:

apiVersion: istio.alibabacloud.com/v1beta1
kind: ASMGrpcJsonTranscoder
metadata:
  name: grpcjsontranscoder-for-kservepredictv2
  namespace: istio-system
spec:
  builtinProtoDescriptor: kserve_predict_v2
  isGateway: true
  portNumber: 8008
  workloadSelector:
    labels:
      istio: ingressgateway

Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to deploy the gRPC-JSON transcoder:
```
kubectl apply -f grpcjsontranscoder-for-kservepredictv2.yaml
```

Create a grpcjsontranscoder-increasebufferlimit.yaml file that contains the following content, and set the per_connection_buffer_limit_bytes parameter to increase the size of the response.

Show the grpcjsontranscoder-increasebufferlimit.yaml file

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  labels:
    asm-system: "true"
    manager: asm-voyage
    provider: asm
  name: grpcjsontranscoder-increasebufferlimit
  namespace: istio-system
spec:
  configPatches:
  - applyTo: LISTENER
    match:
      context: GATEWAY
      listener:
        portNumber: 8008
      proxy:
        proxyVersion: ^1.*
    patch:
      operation: MERGE
      value:
        per_connection_buffer_limit_bytes: 100000000
  workloadSelector:
    labels:
      istio: ingressgateway

Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to deploy an Envoy filter:
```
kubectl apply -f grpcjsontranscoder-increasebufferlimit.yaml
```

Step 3: Deploy a sample model

Create a StorageClass. For more information, see Mount a dynamically provisioned NAS volume.
1. Log on to the ACK console. In the left-side navigation pane, click Clusters.
2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Volumes > StorageClasses.
3. In the upper-right corner of the StorageClasses page, click Create, set the parameters shown in the following figure, and then click Create.

Create a persistent volume claim (PVC).

Create a my-models-pvc.yaml file that contains the following content:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-models-pvc
  namespace: modelmesh-serving
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: alibabacloud-cnfs-nas
  volumeMode: Filesystem

Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to create a PVC:
```
kubectl apply -f my-models-pvc.yaml
```

Run the following command to view the PVC in the modelmesh-serving namespace:

kubectl get pvc -n modelmesh-serving

Expected output:

NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            AGE
my-models-pvc    Bound    nas-379c32e1-c0ef-43f3-8277-9eb4606b53f8   1Gi        RWX            alibabacloud-cnfs-nas   2h

Create a pod to access the PVC.
To use the new PVC, you must mount it as a volume to a Kubernetes pod, and then use that pod to upload the model files to a persistent volume.
1. Create a pvc-access.yaml file that contains the following content.
  The following YAML file indicates that a pvc-access pod is created and the Kubernetes controller is required to claim the previously requested PVC by specifying "my-models-pvc".
```
apiVersion: v1
kind: Pod
metadata:
  name: "pvc-access"
spec:
  containers:
    - name: main
      image: ubuntu
      command: ["/bin/sh", "-ec", "sleep 10000"]
      volumeMounts:
        - name: "my-pvc"
          mountPath: "/mnt/models"
  volumes:
    - name: "my-pvc"
      persistentVolumeClaim:
        claimName: "my-models-pvc"
```
2. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to create a pod:
```
kubectl apply  -n modelmesh-serving  -f pvc-access.yaml
```
3. Verify that the pvc-access pod is running.
```
kubectl get pods -n modelmesh-serving | grep pvc-access
```
  Expected output:
```
pvc-access             1/1     Running   0          51m
```
Store the model on the persistent volume.
Add the AI model to the persistent volume. In this example, the MNIST handwritten digit character recognition model trained with scikit-learn is used. A copy of the mnist-svm.joblib model file can be downloaded from the kserve/modelmesh-minio-examples repository.
1. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to copy the mnist-svm.joblib model file to the /mnt/models folder in the pvc-access pod:
```
kubectl -n modelmesh-serving cp mnist-svm.joblib pvc-access:/mnt/models/
```
2. Run the following command to verify that the model exists on the persistent volume:
```
kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/
```
  Expected output:
```
-rw-r--r-- 1  501 staff 344817 Oct 30 11:23 mnist-svm.joblib
```

Deploy an inference service.

Create a sklearn-mnist.yaml file that contains the following content:

Show the sklearn-mnist.yaml file

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-mnist
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storage:
        parameters:
          type: pvc
          name: my-models-pvc
        path: mnist-svm.joblib

Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to deploy the sklearn-mnist inference service:
```
kubectl apply -f sklearn-mnist.yaml
```
Wait dozens of seconds (the length of waiting time depends on the image pulling speed), and then run the following command to check whether the sklearn-mnist inference service is deployed:
```
kubectl get isvc -n modelmesh-serving
```
Expected output:
```
NAME            URL                                               READY
sklearn-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True
```

Perform an inference.

Run the curl command to send an inference request to the sklearn-mnist model. The data array indicates the grayscale values of the 64 pixels in the image scan of the digit to be classified.

MODEL_NAME="sklearn-mnist"
ASM_GW_IP="IP address of the ingress gateway"
curl -X POST -k "http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{"name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": {"fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}}]}'

The following code block shows the JSON response. It can be inferred that the scanned digit is 8.

{
 "modelName": "sklearn-mnist__isvc-3c10c62d34",
 "outputs": [
  {
   "name": "predict",
   "datatype": "INT64",
   "shape": [
    "1",
    "1"
   ],
   "contents": {
    "int64Contents": [
     "8"
    ]
   }
  }
 ]
}

References

When you deploy multiple models that require different runtime environments, or when you need to improve model inference efficiency or control resource allocation, you can use ModelMesh to create custom model serving runtimes. The fine-tuned configurations of custom model serving runtimes ensure that each model runs in the most appropriate environment. For more information, see Use ModelMesh to create a custom model serving runtime.
When you need to process large amounts of natural language data or want to build complex language understanding systems, you can use a large language model (LLM) as an inference service. For more information, see Use an LLM as an inference service.
When you encounter pod errors, you can troubleshoot them by referring to Pod troubleshooting.

ServingRuntime	Supported model framework
mlserver-1.x	sklearn, xgboost, and lightgbm
ovms-1.x	openvino_ir, onnx
torchserve-0.x	pytorch-mar
triton-2.x	tensorflow, pytorch, onnx, and tensorrt