Configure GPU resources for a Knative Service and enable GPU sharing - Container Service for Kubernetes

When you deploy AI tasks, high-performance computing, or other workloads that require GPU resources in Knative, you can specify GPU-accelerated instance types in the Knative Service to create GPU-accelerated instances. You can also enable the GPU sharing feature to allow multiple pods to share a single GPU, maximizing its usage.

Prerequisites

Knative has been deployed in your cluster. For more information, see Deploy Knative.

Configure GPU resources

You can add the annotation k8s.aliyun.com/eci-use-specs to the spec.template.metadata.annotation section of the configurations of a Knative Service to specify a GPU-accelerated ECS instance type. You can add the nvidia.com/gpu field to the spec.containers.resources.limits section to specify the amount of GPU resources that are required by the Knative Service.

The following code block is an example:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      labels:
        app: helloworld-go
      annotations:
        k8s.aliyun.com/eci-use-specs: ecs.gn5i-c4g1.xlarge  # Specify a GPU-accelerated ECS instance type that is supported by Knative. 
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
          ports:
          - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: '1'    # Specify the number of GPUs that are required by the container. This field is required. If you do not specify this field, an error is returned when the pod is launched.

The following GPU-accelerated ECS instance families are supported:

gn7i, a GPU-accelerated compute-optimized instance family that uses NVIDIA A10 GPUs. This instance family includes a variety of instance types, such as ecs.gn7i-c8g1.2xlarge.
gn7. This instance family includes a variety of instance types, such as ecs.gn7-c12g1.3xlarge.
gn6v, a GPU-accelerated compute-optimized instance family that uses NVIDIA V100 GPUs. This instance family includes a variety of instance types, such as ecs.gn6v-c8g1.2xlarge.
gn6e, a GPU-accelerated compute-optimized instance family that uses NVIDIA V100 GPUs. This instance family includes a variety of instance types, such as ecs.gn6e-c12g1.3xlarge.
gn6i, a GPU-accelerated compute-optimized instance family that uses NVIDIA T4 GPUs. This instance family includes a variety of instance types, such as ecs.gn6i-c4g1.xlarge.
gn5i, GPU-accelerated compute-optimized instance family that uses NVIDIA P4 GPUs. This instance family includes a variety of instance types, such as ecs.gn5i-c2g1.large.
gn5, GPU-accelerated compute-optimized instance family that uses NVIDIA P100 GPUs. This instance family includes a variety of instance types, such as ecs.gn5-c4g1.xlarge.
The gn5 instance family is equipped with local disks. You can mount local disks to elastic container instances. For more information, see Create an elastic container instance that has local disks attached.

Note

The GPU driver version supported by GPU-accelerated elastic container instances is NVIDIA 460.73.01. The CUDA Toolkit version supported by GPU-accelerated elastic container instances is 11.2.
For more information about GPU-accelerated ECS instance families, see ECS instance types available for each region and Overview of instance families.

Enable GPU sharing

See examples to enable the GPU sharing feature for nodes.

You can configure the aliyun.com/gpu-mem field in Knative Service to specify the GPU memory size. Example:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "100"
        autoscaling.knative.dev/minScale: "0"
    spec:
      containerConcurrency: 1
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/hz-suoxing-test/test:helloworld-go
        name: user-container
        ports:
        - containerPort: 6666
          name: http1
          protocol: TCP
        resources:
          limits:
            aliyun.com/gpu-mem: "3" # Specify the GPU memory size.

References

You can deploy AI models as inference services in Knative pods, configure auto scaling, and flexibly allocate GPU resources to improve the utilization of GPU resources and boost the performance of AI inference. For more information, see Best practices for deploying AI inference services in Knative.
For frequently asked questions and solutions about GPU, see FAQ.