KServe supports deploying one or more trained models, such as TFServing and TorchServe, as Kubernetes CustomResourceDefinitions (CRDs) to the model serving runtime. This capability simplifies model deployment and management. You can also use inference services based on Knative to deploy models. This approach allows for auto scaling based on requests per second (RPS), concurrency, and CPU and GPU metrics. Additionally, you can benefit from Knative features, such as automatically scaling instances down to zero when there is no traffic, and simplified multi-version management.
Prerequisites
Knative is deployed in your cluster. For more information, see Deploy Knative.
Step 1: Deploy an inference service
You need to first deploy a predictive inference service that uses the scikit-learn model trained based on the Iris dataset. The dataset covers three Iris types: Iris Setosa (index 0), Iris Versicolour (index 1), and Iris Virginica (index 2). You can send inference requests to the model to predict the types of Irises.
The Iris dataset contains 50 samples for each type of Irises. Each sample has four features, including the length and width of sepals and the length and width of petals.
Run the following command to deploy an inference service named sklearn-iris:
kubectl apply -f - <<EOF apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: model: modelFormat: name: sklearn storageUri: "gs://kfserving-examples/models/sklearn/1.0/model" EOF
Run the following command to query the status of the Service:
kubectl get inferenceservices sklearn-iris
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE sklearn-iris http://sklearn-iris-predictor-default.default.example.com True 100 sklearn-iris-predictor-default-00001 51s
Step 2: Access the Service
The IP address and access method of the Service vary based on the gateway that is used.
ALB
Run the following command to query the address of the ALB gateway:
kubectl get albconfig knative-internet
Expected output:
NAME ALBID DNSNAME PORT&PROTOCOL CERTID AGE knative-internet alb-hvd8nngl0l******* alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com 2
Run the following command to add the following JSON code to the
./iris-input.json
file to create inference requests:cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF
Run the following command to access the Service:
INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}') SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json
Expected output:
* Trying 120.77.XX.XX... * TCP_NODELAY set * Connected to alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com (120.77.XX.XX) port 80 (#0) > POST /v1/models/sklearn-iris:predict HTTP/1.1 > Host: sklearn-iris-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 76 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 76 out of 76 bytes < HTTP/1.1 200 OK < Date: Thu, 13 Jul 2023 01:48:44 GMT < Content-Type: application/json < Content-Length: 21 < Connection: keep-alive < * Connection #0 to host alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com left intact {"predictions":[1,1]}
{"predictions": [1, 1]}
is returned, which indicates that both samples sent to the inference service match index is 1. This means that the Irises in both samples are Iris Versicolour.
MSE
Run the following command to query the address of the MSE gateway:
kubectl -n knative-serving get ing stats-ingress
Expected output:
NAME CLASS HOSTS ADDRESS PORTS AGE stats-ingress knative-ingressclass * 192.168.XX.XX,47.107.XX.XX 80 15d
47.107.XX.XX
in the ADDRESS column is the public IP address of the MSE gateway, which will be used to access the inference Service. The order in which the public and private IP addresses of the MSE gateway are sorted is not fixed. In some cases, the public IP address follows the private IP address, for example,47.107.XX.XX,192.168.XX.XX
.Run the following command to add the following JSON code to the
./iris-input.json
file to create inference requests:cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF
Run the following command to access the Service:
# The order in which the public and private IP addresses of the MSE gateway are sorted is not fixed. In this example, the public IP address is used to access the inference Service. ingress[1] indicates that the public IP address follows the private IP address and ingress[0] indicates that the private IP address follows the public IP address. Choose one of them based on the actual order of the IP addresses. INGRESS_HOST=$(kubectl -n knative-serving get ing stats-ingress -o jsonpath='{.status.loadBalancer.ingress[1].ip}') SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json
Expected output:
* Trying 47.107.XX.XX... # 47.107.XX.XX is the public IP address of the MSE gateway. * TCP_NODELAY set * Connected to 47.107.XX.XX (47.107.XX.XX) port 80 (#0) > POST /v1/models/sklearn-iris:predict HTTP/1.1 > Host: sklearn-iris-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 76 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 76 out of 76 bytes < HTTP/1.1 200 OK < content-length: 21 < content-type: application/json < date: Tue, 11 Jul 2023 09:56:00 GMT < server: istio-envoy < req-cost-time: 5 < req-arrive-time: 1689069360639 < resp-start-time: 1689069360645 < x-envoy-upstream-service-time: 4 < * Connection #0 to host 47.107.XX.XX left intact {"predictions":[1,1]}
{"predictions": [1, 1]}
is returned, which indicates that both samples sent to the inference Service match index is 1. This means that the Irises in both samples are Iris Versicolour.
Kourier
Run the following command to query the address of the Kourier gateway:
kubectl -n knative-serving get svc kourier
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kourier LoadBalancer 192.168.XX.XX 121.40.XX.XX 80:31158/TCP,443:32491/TCP 49m
The IP address of the inference Service is
121.40.XX.XX
and the Service ports are HTTP 80 and HTTPS 443.Run the following command to add the following JSON code to the
./iris-input.json
file to create inference requests:cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF
Run the following command to access the Service:
INGRESS_HOST=$(kubectl -n knative-serving get service kourier -o jsonpath='{.status.loadBalancer.ingress[0].ip}') SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json
Expected output:
* Trying 121.40.XX.XX... * TCP_NODELAY set * Connected to 121.40.XX.XX (121.40.XX.XX) port 80 (#0) > POST /v1/models/sklearn-iris:predict HTTP/1.1 > Host: sklearn-iris-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 76 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 76 out of 76 bytes < HTTP/1.1 200 OK < content-length: 21 < content-type: application/json < date: Wed, 12 Jul 2023 08:23:13 GMT < server: envoy < x-envoy-upstream-service-time: 4 < * Connection #0 to host 121.40.XX.XX left intact {"predictions":[1,1]}
{"predictions": [1, 1]}
is returned, which indicates that both samples sent to the inference Service match index is 1. This means that the Irises in both samples are Iris Versicolour.
References
For recommended configurations about deploying AI inference services in Knative, see Best practices for deploying AI inference services in Knative.
ACK Knative provides application templates for Stable Diffusion. For more information, see Deploy a Stable Diffusion Service based on Knative.