Best practices for deploying AI inference services in Knative - Container Service for Kubernetes

You can use Knative with AI to benefit from quick deployment, high elasticity, and cost-efficiency in scenarios where you need to frequently adjust computing resources for AI applications, such as model inference scenarios. You can deploy AI models as inference services in Knative pods, configure auto scaling, and flexibly allocate GPU resources to improve the utilization of GPU resources and boost the performance of AI inference.

Accelerated model deployment

To ensure that AI inference services deployed in Knative Serving can be quickly scaled on demand, you must avoid packaging AI models in container images. When AI models are packaged in container images, the container images become extremely large, which slows down the container deployment. In addition, the versions of AI models are bound to container image versions, which makes version control more complex.

To avoid the preceding issues, we recommend that you upload all data related to your AI model to an external storage system, such as Object Storage Service (OSS) or File Storage NAS (NAS), and create a persistent volume claims (PVC) to mount the storage system to the pods of the Knative Service. You can also use Fluid, a distributed dataset orchestration and acceleration engine, to accelerate AI model pulling and loading. This allows you to load large language models within a few seconds. For more information, see Use JindoFS to accelerate access to OSS and Use EFC to accelerate access to NAS file systems.

Prerequisites

Knative has been deployed in your cluster. For more information, see Deploy Knative.
The ack-fluid component has been installed. For more information, see Deploy the cloud-native AI suite.
An OSS bucket has been created. For more information, see Create buckets.

Step 1: Define a Dataset

If your AI model is stored in OSS, you can create a Dataset CustomResource (CR) to declare the dataset of the model in OSS and use JindoRuntime to run caching tasks.

View sample YAML content

apiVersion: v1
kind: Secret
metadata:
  name: access-key
stringData:
  fs.oss.accessKeyId: your_ak_id   # Replace with the AccessKey ID used to access OSS. 
  fs.oss.accessKeySecret: your_ak_skrt   # Replace with the AccessKey secret used to access OSS. 
---
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: oss-data
spec:
  mounts:
  - mountPoint: "oss://{Bucket}/{path-to-model}" # Replace with the OSS bucket and OSS path where the model is stored. 
    name: xxx
    path: "{path-to-model}" # Replace with the path of the model to be loaded. 
    options:
      fs.oss.endpoint: "oss-cn-beijing.aliyuncs.com"  # Replace with the endpoint of the OSS bucket. 
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: access-key
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: access-key
            key: fs.oss.accessKeySecret
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: oss-data
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: emptyDir
        path: /mnt/ssd0/cache
        quota: 100Gi
        high: "0.95"
        low: "0.7"
  fuse:
    properties:
      fs.jindofsx.data.cache.enable: "true"
    args:
      - -okernel_cache
      - -oro
      - -oattr_timeout=7200
      - -oentry_timeout=7200
      - -ometrics_port=9089
    cleanPolicy: OnDemand

Step 2: Mount a PVC to accelerate access to OSS

After you declare a Dataset CR and a JindoRuntime CR, the system automatically creates a PVC with the name as the Dataset and mounts the PVC to the Knative Service to accelerate access to OSS.

View sample YAML content

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sd-backend
spec:
  template:
    spec:
      containers:
        - image: <YOUR-IMAGE>   # Replace <YOUR-IMAGE> with the name of the image. 
          name: image-name
          ports:
            - containerPort: xxx
              protocol: TCP
          volumeMounts:
            - mountPath: /data/models  # The path of the model to be loaded. Replace with the value of moutPath. 
              name: data-volume
      volumes:
        - name: data-volume
          persistentVolumeClaim:
            claimName: oss-data   # Mount a PVC with the same name as the Dataset.

Step 3: Deploy the model within seconds by using image caches

You must take into consideration the impact of the container image size on the Knative Service besides the size of the AI model. Container images used to run AI models are usually packaged with dependencies such as CUDA and pytorch-gpu. These dependencies increase the size of the container images. When you deploy Knative AI inference services in pods on elastic container instances, we recommend that you use image caches to accelerate image pulling. Caches of images that are used to deploy pods on elastic container instances are created in advance so that the system can pull images within seconds. For more information, see Use ImageCache to accelerate the creation of elastic container instances.

apiVersion: eci.alibabacloud.com/v1
kind: ImageCache
metadata:
  name: imagecache-ai-model
  annotations:
    k8s.aliyun.com/eci-image-cache: "true" # Enable the reuse of image caches. 
spec:
  images:
  - <YOUR-IMAGE>
  imageCacheSize:
    25 # The size of each image cache. Unit: GiB. 
  retentionDays:
    7 # The retention period of image caches.

Graceful pod shutdown

To ensure that ongoing requests are not interrupted, pods must be shut down in a specific way after receiving the SIGTERM signal.

If an application uses HTTP probing, after the application pods receive the SIGTERM signal, the pods enter the Not Ready state in case new requests are forwarded to these pods.
During the pod shutdown process, Knative queue-proxy may still forward requests to the pods. We recommend that you set the timeoutSeconds parameter to 1.2 times the maximum timeout period to ensure that all requests can be processed before the pods are shut down. For example, if the maximum timeout period of requests is 10 seconds, set the timeoutSeconds parameter to 12 seconds.
```
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    spec:
      timeoutSeconds: 12
```

Configure concurrency parameters

You can configure concurrency parameters to greatly improve the performance and response speed of Knative Services. Proper concurrency settings enable your application to handle large numbers of concurrent requests more efficiently, which further improves the service quality and user experience.

Hard concurrency limit: The hard concurrency limit is an enforced upper limit. When the concurrency level reaches the hard limit, excessive requests are cached. These requests must remain pending until there are sufficient resources.
Soft concurrency limit: The soft concurrency limit is not a strict constraint. When the number of requests spikes, the concurrency level may exceed the soft limit.
Utilization target: Compared with target, the autoscaling.knative.dev/target-utilization-percentage annotation is a more direct way to specify the concurrency level. You can set the annotation to a value close to 100% to efficiently process concurrent requests.

For more information, see Configure the soft concurrency limit.

Auto scaling

Before you configure auto scaling, you need to set objectives, such as how to reduce the latency, minimize the cost, and handle traffic spikes. Then, you need to estimate the amount of time required for launching pods in the desired scenarios. For example, there is a time difference between launching a single pod and launching multiple pods concurrently. These metrics can be used as the basis of auto scaling.

For more information about the parameters, see Enable auto scaling to withstand traffic fluctuations.

Scaling modes

Auto scaling works in two modes: stable mode and panic mode. Each mode runs within a time-based window. The stable mode is suitable for regular operations. By default, the panic window is shorter. Therefore, the panic mode is suitable for quickly creating pods to handle traffic spikes.

Stable window

To ensure that pods can be scaled as expected, the stable window must be longer than the average pod launch time. We recommend that you set the stable window to twice the average time.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/window: "40s"

Panic window

The panic window is defined as a percentage of the stable window. The panic window usually handles unexpected traffic spikes.

Important

An inappropriate panic window may result in an excessive number of pods when the pod launch time is long. Proceed with caution.

If the stable window is 30 seconds and the panic window is 10%, the system decides whether to start the panic mode based statistics collected within 3 seconds. If pods require 30 seconds to launch, the system may continue to deploy pods when the newly created pods are still launching. Consequently, excessive pods are created.
Do not modify the panic window before you modify other parameters and confirm that scaling activities are performed as expected, especially when the panic window is equal to or shorter than the pod launch time. If the stable window is twice the average pod launch time, set the panic window to 50% as the boundary between the stable and panic modes.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/panic-window-percentage: "20.0"

The panic mode threshold defines the ratio of inbound traffic to service capacity. Modify the threshold based on the peak value of traffic and the latency that you can tolerate. The initial value is 200% or higher until the stable mode starts. You can modify the threshold after your application runs for a period of time.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/panic-threshold-percentage: "150.0"

Scale rates

You can adjust the scale up rate and scale down rate to better control pod scaling. The default scale rates can meet requirements in most scenarios. Before you modify the scale rates, observe the current status, modify and test the new scale rates in a staging environment, and assess the impact. If a scaling issue occurs, check the relevant parameters, such as the stable window and panic window.

Scale up rate

Do not modify the scale up rate unless you need to resolve certain issues. For example, you can modify the scale up rate if the pod launch time is long because the newly created pods need to wait for resources. If scaling activities are frequently triggered, you may also need to modify other parameters, such as the stable window or panic window.

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving
data:
  max-scale-up-rate: "500.0"

Scale down rate

Set the scale down rate to the average pod launch time or longer. The default scale down rate can meet requirements in most scenarios. However, if you want to reduce costs, you can increase the scale down rate to quickly delete pods when the traffic volume drops.

The scale down rate is a ratio. For example, N/2 means that pods are scaled to half of the current number when the current scaling cycle ends. For Services that run only a small number of pods, you can reduce the scale down rate to maintain a certain number of pods. This improves system stability and avoids frequent scaling.

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving
data:
  max-scale-down-rate: "4.0"

Scale down delay

You can set a scale down delay to prevent the autoscaler from frequently creating pods when new requests are received. This helps resolve the issue that pods respond to requests with a delay.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/scale-down-delay: "2m" // Scale-down activities are performed with a delay of two minutes.

Scale bounds

Scale bounds are related to the business scale and resource availability.

Lower bound

autoscaling.knative.dev/min-scale controls the minimum number of replicas for each revision. Knative will keep the number of replicas equal to or greater than the lower bound. For Services that are infrequently accessed but must remain available constantly, set the lower bound to 1 to ensure at least one replica. For Services that may experience traffic spikes, set the lower bound to a value greater than 1. If you want the replicas to scale to zero when no request is received, set the lower bound to 0.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"

Upper bound

The value of autoscaling.knative.dev/max-scale must not exceed the amount of available resources. The upper bound limits the maximum resource cost.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/max-scale: "3"

Initial scale

Set the initial scale to 1.2 times the current number of pods so that traffic can be distributed to new versions.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/initial-scale: "3"

GPU sharing

Empowered by the GPU sharing feature of Container Service for Kubernetes (ACK), Knative Services can make full use of the GPU memory isolation capability of cGPU to improve GPU resource utilization.

To share GPUs in Knative, you need only to set limits to aliyun.com/gpu-mem for the Knative Service. Example:

View sample YAML content

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    spec:
      containerConcurrency: 1
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/demo-test/test:helloworld-go
        name: user-container
        ports:
        - containerPort: 6666
          name: http1
          protocol: TCP
        resources:
          limits:
            aliyun.com/gpu-mem: "3"

Probes

You can configure liveness probes and readiness probes to monitor the health status and availability of Knative Services. Compared with the Kubernetes probing policy, Knative probes Services more frequently to minimize the cold start time of pods.

Liveness probes: used to monitor the health status of containers. If a container is in the Failed state or the service in the container fails to launch, the liveness probe restarts the container.
Readiness probes: used to efficiently manage the auto scaling of applications to ensure that only pods in the Ready state can receive traffic. This improves the stability and response speeds of the service.
- A TCP probe does not open the listening port until all components are loaded and the containers are ready.
- An HTTP probe does not mark a Service as ready until the endpoint of the Service is ready to handle requests.
- Set the periodSeconds (probing interval) to a small value. In most cases, specify a probing interval shorter than the default value 10 seconds.

For more information about port probing and best practices, see Configure port probing in Knative.

Scale to zero

Knative can scale the pods of a Service to zero when the Service is idle.

If you need to manage multiple models, we recommend that you create a Knative Service for each model to simplify the business logic in each Service.

The scale to zero feature of Knative can prevent idle models from incurring resource fees.
When a client accesses a model for the first time, Knative starts to scale up from zero.

For more information about scale to zero and best practices, see Configure a reserved instance.

Canary releases

Knative can ensure business continuity during new version releases. When you update the code or modify a Service, Knative continuously runs the old version until the new version is ready.

You can specify the percentage of traffic distributed to the new version. The traffic percentage is a key parameter. It determines the time when Knative switches traffic from the old version to the new version. You can set the traffic percentage in a range of 10% to 100%. When you no longer need the old version, set the percentage to 100%. The old version is phased out immediately after the new version is ready.

You need to limit the scale-down speed of the old version when performing batch operations with a limited number of pods. For example, a Service runs 500 pods and only 300 additional pods are available. If you set the traffic percentage to a large value, some Service features may become unavailable.

For more information about canary releases and best practices, see Perform a canary release based on traffic splitting for a Knative Service.

Colocate ECS instances and elastic container instances

You can configure ResourcePolicies to use Elastic Compute Service (ECS) instances during off-peak hours and use elastic container instances during peak hours.

During the scale-up process, Knative pods are preferably scheduled to ECS instances. When the number of pods exceeds the predefined upper limit or ECS instances are out of stock, new pods are scheduled to elastic container instances.
During the scale-down process, pods on elastic container instances are preferably deleted.

For more information about colocation and best practices, see Colocate ECS instances and elastic container instances in Knative.