When you deploy multiple models that require different runtime environments, or when you need to improve model inference efficiency or control resource allocation, you can use Model Service Mesh (ModelMesh) to create custom model serving runtimes. The fine-tuned configurations of custom model serving runtimes ensure that each model runs in the most appropriate environment. This can help you improve service quality, reduce costs, and simplify O&M of complex models. This topic describes how to use ModelMesh Serving to customize a model serving runtime.
Prerequisites
A Container Service for Kubernetes (ACK) cluster is added to your Service Mesh (ASM) instance and your ASM instance is of version 1.18.0.134 or later.
Feature description
By default, ModelMesh is integrated with the following model serving runtimes.
Model server | Developed by | Applicable framework | Benefit |
Triton Inference Server | NVIDIA | TensorFlow, PyTorch, TensorRT, and ONNX | This model server is suitable for high-performance, scalable, and low-latency inference services and provides tools for management and monitoring. |
MLServer | Seldon | SKLearn, XGBoost, and LightGBM | This model server provides a unified API and framework and supports multiple frameworks and advanced features. |
OpenVINO Model Server | Intel | OpenVINO and ONNX | This model server uses the hardware acceleration technology of Intel and supports multiple frameworks. |
TorchServe | PyTorch | PyTorch (including the eager mode) | TorchServe is a lightweight and scalable model server developed by PyTorch. |
If the preceding model servers cannot meet your specific requirements, for example, you need to process custom logic for inference, or the framework required by your model is not supported by the preceding model servers, you can create a custom serving runtime to meet your requirements.
Step 1: Create a custom serving runtime
A namespace-scoped ServingRuntime or a cluster-scoped ClusterServingRuntime defines the templates for pods that can serve one or more particular model formats. Each ServingRuntime or ClusterServingRuntime defines key information such as the container image of a runtime and a list of the supported model formats. Other configurations for the runtime can be passed by environment variables in the spec field.
The ServingRuntime CustomResourceDefinitions (CRDs) allow for improved flexibility and extensibility, enabling you to customize reusable runtimes without modifying the ModelMesh controller code or other resources in the controller namespace. This means that you can easily build a custom runtime to support your framework.
To create custom serving runtimes, you must build a new container image with support for the desired framework and then create a ServingRuntime resource that uses that image. This is especially easy if the framework of the desired runtime uses Python bindings. In this case, you can use the extension point of MLServer to add additional frameworks. MLServer provides a serving interface. ModelMesh Serving integrates MLServer as a ServingRuntime.
To build a Python-based custom serving runtime, perform the following steps:
Implement a class that inherits from the MLModel class of MLServer.
You can add an implementation of the MLModel class to extend MLServer. Two main functions
load()
andpredict()
are involved. Depending on your needs, you can use theload()
function to load your model and use thepredict()
function to make a prediction. You can also view example implementations of the MLModel class in the MLServer documentation.Package the model class and dependencies into a container image.
After the model class is implemented, you need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime resource. MLServer provides a helper for you to build an image by using the
mlserver build
command. For more information, see Building a custom image.Create a new ServingRuntime resource by using that image.
Create a new ServingRuntime resource by using the following content and point it to the image you created:
Field
Description
{{CUSTOM-RUNTIME-NAME}}
The name of the runtime, such as my-model-server-0.x.
{{MODEL-FORMAT-NAMES}}
The list of model formats that the runtime supports, such as my-model. For example, when you deploy a model of the my-model format, ModelMesh will check the model format against this list to determine whether this runtime is suitable for the model.
{{CUSTOM-IMAGE-NAME}}
The image created in Step 2.
Run the following command to create a ServingRuntime resource:
kubectl apply -f ${Name of the YAML file}.yaml
After you create the ServingRuntime resource, you can see the new custom runtime in your ModelMesh deployment.
Step 2: Deploy a model
To deploy a model by using the newly created runtime, you must create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models. It represents the logical endpoint of the model for serving inferences.
Create an InferenceService resource to serve the model by using the following content:
In the YAML file, the InferenceService resource names the model
my-model-sample
and declares its model formatmy-model
, which is the same format as the example custom runtime created in the previous step. An optional fieldruntime
is also passed, explicitly telling ModelMesh to use the my-model-server-0.x runtime to deploy this model. Thestorage
field specifies where the model resides. In this case, the localMinIO instance that is deployed by using the quickstart guide of ModelMesh Serving is used.Run the following command to deploy the InferenceService resource:
kubectl apply -f ${Name of the YAML file}.yaml