Deploy KServe, the machine learning model serving framework -

KServe is a Kubernetes-based machine learning model serving framework. It supports deploying one or multiple trained models, such as TFServing, TorchServe, Triton, and other inference servers, as Kubernetes CustomResourceDefinitions (CRDs) to the model serving runtime. This simplifies and accelerates the processes of deploying, updating, and scaling models. The core component of KServe is KServe Controller. You can install KServe Controller through the console to use features like auto scaling based on request traffic.

Overview of KServe

KServe is Kubernetes-based machine learning model serving framework. KServe provides simple Kubernetes CustomResourceDefinitions (CRDs) to allow you to deploy one or more trained models, such as TFServing, TorchServe, and Triton inference servers, to a model serving runtime. ModelServer and MLServer are two model serving runtimes used in KServe to deploy and manage machine learning models. These model serving runtimes allow you to use out-of-the-box model serving. ModelServer is a Python model serving runtime implemented with KServe prediction protocol v1. MLServer implements KServe prediction protocol v2 with REST and gRPC. You can also build custom model servers for complex use cases. In addition, KServe provides basic API primitives to allow you to build custom model serving runtimes with ease. You can use other tools such as BentoML to build custom model serving images.

After you use Knative InferenceService to deploy models, you can use the following serverless features provided by KServe.

Scale to zero
Auto scaling based on requests per second (RPS), concurrency, and CPU and GPU metrics
Version management
Traffic Management
Security authentication
Out-of-the-box metrics

KServe controller

The KServe controller is a key component of KServe. The KServe controller manages custom InferenceService resources, and creates and deploys Knative Services to automate resource scaling. The KServe controller can scale the Deployment of a Knative Service based on the traffic volume. When no requests are sent to the Knative Service, the KServe controller automatically scales the Service pods to zero. Auto scaling can exploit model serving resources in a more efficient way and prevent resource waste.

Prerequisites

Knative is deployed in your cluster. For more information, see Deploy Knative.

Deploy KServe

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Knative.
On the Components tab, find KServe and click Deploy in the Actions column. Complete the deployment as prompted.
If the Status column of the KServe component displays Deployed, the component is deployed.

References

After deploying the component, you can quickly deploy an inference Service based on KServe.