Quickly deploy an inference service based on KServe - Container Service for Kubernetes

KServe supports deploying one or more trained models, such as TFServing and TorchServe, as Kubernetes CustomResourceDefinitions (CRDs) to the model serving runtime. This capability simplifies model deployment and management. You can also use inference services based on Knative to deploy models. This approach allows for auto scaling based on requests per second (RPS), concurrency, and CPU and GPU metrics. Additionally, you can benefit from Knative features, such as automatically scaling instances down to zero when there is no traffic, and simplified multi-version management.

Prerequisites

Knative is deployed in your cluster. For more information, see Deploy Knative.

Step 1: Deploy an inference service

You need to first deploy a predictive inference service that uses the scikit-learn model trained based on the Iris dataset. The dataset covers three Iris types: Iris Setosa (index 0), Iris Versicolour (index 1), and Iris Virginica (index 2). You can send inference requests to the model to predict the types of Irises.

Note

The Iris dataset contains 50 samples for each type of Irises. Each sample has four features, including the length and width of sepals and the length and width of petals.

Run the following command to deploy an inference service named sklearn-iris:

kubectl apply -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

Run the following command to query the status of the Service:

kubectl get inferenceservices sklearn-iris

Expected output:

NAME           URL                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                    AGE
sklearn-iris   http://sklearn-iris-predictor-default.default.example.com   True           100                              sklearn-iris-predictor-default-00001   51s

Step 2: Access the Service

The IP address and access method of the Service vary based on the gateway that is used.

ALB

Run the following command to query the address of the ALB gateway:

kubectl get albconfig knative-internet

Expected output:

NAME               ALBID                    DNSNAME                                              PORT&PROTOCOL   CERTID   AGE
knative-internet   alb-hvd8nngl0l*******   alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com                               2

Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF

Run the following command to access the Service:

INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json

Expected output:

*   Trying 120.77.XX.XX...
* TCP_NODELAY set
* Connected to alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com (120.77.XX.XX) port 80 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host: sklearn-iris-predictor-default.default.example.com
> User-Agent: curl/7.58.0
> Accept: */*
> Content-Length: 76
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 76 out of 76 bytes
< HTTP/1.1 200 OK
< Date: Thu, 13 Jul 2023 01:48:44 GMT
< Content-Type: application/json
< Content-Length: 21
< Connection: keep-alive
< 
* Connection #0 to host alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com left intact
{"predictions":[1,1]}

{"predictions": [1, 1]} is returned, which indicates that both samples sent to the inference service match index is 1. This means that the Irises in both samples are Iris Versicolour.

MSE

Run the following command to query the address of the MSE gateway:
```
kubectl -n knative-serving get ing stats-ingress
```
Expected output:
```
NAME            CLASS                  HOSTS   ADDRESS                         PORTS   AGE
stats-ingress   knative-ingressclass   *       192.168.XX.XX,47.107.XX.XX      80      15d
```
47.107.XX.XX in the ADDRESS column is the public IP address of the MSE gateway, which will be used to access the inference Service. The order in which the public and private IP addresses of the MSE gateway are sorted is not fixed. In some cases, the public IP address follows the private IP address, for example, 47.107.XX.XX,192.168.XX.XX.

Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF

Run the following command to access the Service:

# The order in which the public and private IP addresses of the MSE gateway are sorted is not fixed. In this example, the public IP address is used to access the inference Service. ingress[1] indicates that the public IP address follows the private IP address and ingress[0] indicates that the private IP address follows the public IP address. Choose one of them based on the actual order of the IP addresses. 
INGRESS_HOST=$(kubectl -n knative-serving get ing stats-ingress -o jsonpath='{.status.loadBalancer.ingress[1].ip}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json

Expected output:

*   Trying 47.107.XX.XX... # 47.107.XX.XX is the public IP address of the MSE gateway. 
* TCP_NODELAY set
* Connected to 47.107.XX.XX (47.107.XX.XX) port 80 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host: sklearn-iris-predictor-default.default.example.com
> User-Agent: curl/7.58.0
> Accept: */*
> Content-Length: 76
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 76 out of 76 bytes
< HTTP/1.1 200 OK
< content-length: 21
< content-type: application/json
< date: Tue, 11 Jul 2023 09:56:00 GMT
< server: istio-envoy
< req-cost-time: 5
< req-arrive-time: 1689069360639
< resp-start-time: 1689069360645
< x-envoy-upstream-service-time: 4
< 
* Connection #0 to host 47.107.XX.XX left intact
{"predictions":[1,1]}

{"predictions": [1, 1]} is returned, which indicates that both samples sent to the inference Service match index is 1. This means that the Irises in both samples are Iris Versicolour.

Kourier

Run the following command to query the address of the Kourier gateway:

kubectl -n knative-serving get svc kourier

Expected output:

NAME      TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                      AGE
kourier   LoadBalancer   192.168.XX.XX   121.40.XX.XX  80:31158/TCP,443:32491/TCP   49m

The IP address of the inference Service is 121.40.XX.XX and the Service ports are HTTP 80 and HTTPS 443.

Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF

Run the following command to access the Service:

INGRESS_HOST=$(kubectl -n knative-serving get service kourier -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json

Expected output:

*   Trying 121.40.XX.XX...
* TCP_NODELAY set
* Connected to 121.40.XX.XX (121.40.XX.XX) port 80 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host: sklearn-iris-predictor-default.default.example.com
> User-Agent: curl/7.58.0
> Accept: */*
> Content-Length: 76
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 76 out of 76 bytes
< HTTP/1.1 200 OK
< content-length: 21
< content-type: application/json
< date: Wed, 12 Jul 2023 08:23:13 GMT
< server: envoy
< x-envoy-upstream-service-time: 4
< 
* Connection #0 to host 121.40.XX.XX left intact
{"predictions":[1,1]}

{"predictions": [1, 1]} is returned, which indicates that both samples sent to the inference Service match index is 1. This means that the Irises in both samples are Iris Versicolour.

References

For recommended configurations about deploying AI inference services in Knative, see Best practices for deploying AI inference services in Knative.
ACK Knative provides application templates for Stable Diffusion. For more information, see Deploy a Stable Diffusion Service based on Knative.