Use ModelMesh to create a custom model serving runtime - Alibaba Cloud Service Mesh

When you deploy multiple models that require different runtime environments, or when you need to improve model inference efficiency or control resource allocation, you can use Model Service Mesh (ModelMesh) to create custom model serving runtimes. The fine-tuned configurations of custom model serving runtimes ensure that each model runs in the most appropriate environment. This can help you improve service quality, reduce costs, and simplify O&M of complex models. This topic describes how to use ModelMesh Serving to customize a model serving runtime.

Prerequisites

A Container Service for Kubernetes (ACK) cluster is added to your Service Mesh (ASM) instance and your ASM instance is of version 1.18.0.134 or later.

Feature description

By default, ModelMesh is integrated with the following model serving runtimes.

Model server	Developed by	Applicable framework	Benefit
Triton Inference Server	NVIDIA	TensorFlow, PyTorch, TensorRT, and ONNX	This model server is suitable for high-performance, scalable, and low-latency inference services and provides tools for management and monitoring.
MLServer	Seldon	SKLearn, XGBoost, and LightGBM	This model server provides a unified API and framework and supports multiple frameworks and advanced features.
OpenVINO Model Server	Intel	OpenVINO and ONNX	This model server uses the hardware acceleration technology of Intel and supports multiple frameworks.
TorchServe	PyTorch	PyTorch (including the eager mode)	TorchServe is a lightweight and scalable model server developed by PyTorch.

If the preceding model servers cannot meet your specific requirements, for example, you need to process custom logic for inference, or the framework required by your model is not supported by the preceding model servers, you can create a custom serving runtime to meet your requirements.

Step 1: Create a custom serving runtime

A namespace-scoped ServingRuntime or a cluster-scoped ClusterServingRuntime defines the templates for pods that can serve one or more particular model formats. Each ServingRuntime or ClusterServingRuntime defines key information such as the container image of a runtime and a list of the supported model formats. Other configurations for the runtime can be passed by environment variables in the spec field.

The ServingRuntime CustomResourceDefinitions (CRDs) allow for improved flexibility and extensibility, enabling you to customize reusable runtimes without modifying the ModelMesh controller code or other resources in the controller namespace. This means that you can easily build a custom runtime to support your framework.

To create custom serving runtimes, you must build a new container image with support for the desired framework and then create a ServingRuntime resource that uses that image. This is especially easy if the framework of the desired runtime uses Python bindings. In this case, you can use the extension point of MLServer to add additional frameworks. MLServer provides a serving interface. ModelMesh Serving integrates MLServer as a ServingRuntime.

To build a Python-based custom serving runtime, perform the following steps:

Implement a class that inherits from the MLModel class of MLServer.
You can add an implementation of the MLModel class to extend MLServer. Two main functions load() and predict() are involved. Depending on your needs, you can use the load() function to load your model and use the predict() function to make a prediction. You can also view example implementations of the MLModel class in the MLServer documentation.
Package the model class and dependencies into a container image.
After the model class is implemented, you need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime resource. MLServer provides a helper for you to build an image by using the mlserver build command. For more information, see Building a custom image.

Create a new ServingRuntime resource by using that image.

Create a new ServingRuntime resource by using the following content and point it to the image you created:

Show the YAML file

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: {{CUSTOM-RUNTIME-NAME}}
spec:
  supportedModelFormats:
    - name: {{MODEL-FORMAT-NAMES}}
      version: "1"
      autoSelect: true
  multiModel: true
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  containers:
    - name: mlserver
      image: {{CUSTOM-IMAGE-NAME}}
      env:
        - name: MLSERVER_MODELS_DIR
          value: "/models/_mlserver_models/"
        - name: MLSERVER_GRPC_PORT
          value: "8001"
        - name: MLSERVER_HTTP_PORT
          value: "8002"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          value: "false"
        - name: MLSERVER_MODEL_NAME
          value: dummy-model
        - name: MLSERVER_HOST
          value: "127.0.0.1"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          value: "-1"
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: "5"
          memory: 1Gi
  builtInAdapter:
    serverType: mlserver
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Field	Description
`{{CUSTOM-RUNTIME-NAME}}`	The name of the runtime, such as my-model-server-0.x.
`{{MODEL-FORMAT-NAMES}}`	The list of model formats that the runtime supports, such as my-model. For example, when you deploy a model of the my-model format, ModelMesh will check the model format against this list to determine whether this runtime is suitable for the model.
`{{CUSTOM-IMAGE-NAME}}`	The image created in Step 2.

Run the following command to create a ServingRuntime resource:
```
kubectl apply -f ${Name of the YAML file}.yaml
```
After you create the ServingRuntime resource, you can see the new custom runtime in your ModelMesh deployment.

Step 2: Deploy a model

To deploy a model by using the newly created runtime, you must create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models. It represents the logical endpoint of the model for serving inferences.

Create an InferenceService resource to serve the model by using the following content:
Show the YAML file
```
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model-sample
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: my-model
      runtime: my-model-server
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib
```
In the YAML file, the InferenceService resource names the model my-model-sample and declares its model format my-model, which is the same format as the example custom runtime created in the previous step. An optional field runtime is also passed, explicitly telling ModelMesh to use the my-model-server-0.x runtime to deploy this model. The storage field specifies where the model resides. In this case, the localMinIO instance that is deployed by using the quickstart guide of ModelMesh Serving is used.
Run the following command to deploy the InferenceService resource:
```
kubectl apply -f ${Name of the YAML file}.yaml
```