GPU-accelerated ECS instance types - Elastic Container Instance

This topic describes how to specify GPU-accelerated Elastic Compute Service (ECS) instance types to create an Elastic Container Instance (ECI) pod.

Supported instance type families

GPU-accelerated ECS instance types contain GPUs and are suitable for scenarios such as deep learning and image processing. GPU-related Docker images can be directly run on a GPU-accelerated elastic container instance. A NVIDIA GPU driver is pre-installed in the instance. The supported driver and CUDA versions vary with GPU types.

Note

The gn8ia and gn8is instance families in the following table are available only in specific regions outside the Chinese mainland. To use the instance families, contact Alibaba Cloud sales personnel.

Category	GPU-accelerated instance family	Driver and CUDA versions
vGPU-accelerated instance families	sgn7i-vws	NVIDIA 470.161.03 and CUDA 11.4
	vgn7i-vws
	vgn6i-vws
GPU-accelerated compute-optimized instance families	gn7e	NVIDIA 470.82.01 and CUDA 11.4 (default) NVIDIA 525.85.12 and CUDA 12.0 NVIDIA 535.161.08 and CUDA 12.2
	gn7i
	gn7s
	gn7
	gn6v
	gn6e
	gn6i
	gn5i
	gn5
	gn8ia	NVIDIA 535.161.08 and CUDA 12.2
	gn8is	NVIDIA 535.161.08 and CUDA 12.2

For more information about ECS instance families, see the following topics:

Configurations

You can add annotations to the metadata in the configuration file of a pod to specify GPU-accelerated ECS instance types. After you specify GPU-accelerated ECS instance types, you must add the nvidia.com/gpu field to the containers.resources section to specify the number of GPUs that you want to allocate to a container.

Important

The value of the nvidia.com/gpu field specifies the number of GPUs that you want to allocate to a container. You must specify the field when you create a GPU-accelerated pod. If you do not specify this field, an error is returned when the pod is started.
By default, multiple containers in the elastic container instance can share the GPUs. You must make sure that the number of GPUs that you allocate to a single container does not exceed the number of GPUs that the specified GPU-accelerated ECS instance type provides.

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
  labels:
    app: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx-test
      labels:
        app: nginx
        alibabacloud.com/eci: "true" 
      annotations:
        k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c4g1.xlarge,ecs.gn6i-c8g1.2xlarge" # Specify a maximum of five GPU-accelerated ECS instance types at a time. 
    spec:
      containers:
      - name: nginx
        image: registry.cn-shanghai.aliyuncs.com/eci_open/nginx:1.14.2
        resources:
            limits:
              nvidia.com/gpu: "1" # The number of GPUs required by the Nginx container. The GPUs are shared. 
        ports:
        - containerPort: 80
      - name: busybox
        image: registry.cn-shanghai.aliyuncs.com/eci_open/busybox:1.30
        command: ["sleep"]
        args: ["999999"]
        resources:
            limits:
              nvidia.com/gpu: "1" # The number of GPUs required by the BusyBox container. The GPUs are shared.

By default, a GPU-accelerated elastic container instance automatically installs the supported driver and CUDA versions based on the specified GPU-accelerated ECS instance type. In some scenarios, you may need to use different driver and CUDA versions for different GPU-accelerated elastic container instances. In this case, you can add annotations to specify the driver and CUDA versions. For example, if you specify ecs.gn6i-c4g1.xlarge as the GPU-accelerated ECS instance type, the default driver and CUDA versions installed are NVIDIA 470.82.01 and CUDA 11.4. After you add the k8s.aliyun.com/eci-gpu-driver-version: tesla=525.85.12 annotation, the driver and CUDA versions installed change to NVIDIA 525.85.12 and CUDA 12.0. The following code provides an example in YAML format.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
  labels:
    app: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx-test
      labels:
        app: nginx
        alibabacloud.com/eci: "true" 
      annotations:
        k8s.aliyun.com/eci-use-specs: ecs.gn6i-c4g1.xlarge # Specify the supported GPU-accelerated ECS instance types. The instance types support the change of driver version. 
        k8s.aliyun.com/eci-gpu-driver-version: tesla=525.85.12 # to specify the GPU driver version. 
    spec:
      containers:
      - name: nginx
        image: registry.cn-shanghai.aliyuncs.com/eci_open/nginx:1.14.2
        resources:
            limits:
              nvidia.com/gpu: "1" # The number of GPUs required by the container.