cGPUのメモリ分離機能を無効にする方法 - Container Service for Kubernetes

このトピックでは、Container Service for Kubernetes (ACK) クラスターのcGPUのメモリ分離機能を無効にする方法の例を示します。

シナリオ

このトピックは、cGPUのメモリ分離機能が有効になっているACK専用クラスターとACK Proクラスターに適用されます。

前提条件

ack-cgpuコンポーネントがクラスターにインストールされています。詳細については、「ack-cgpuのインストール」または「ack-ai-installerとGPUインスペクションツールのインストールと使用」をご参照ください。

手順

次のコマンドを実行して、クラスター内のGPU共有のステータスを照会します。

kubectl inspect cgpu

期待される出力:

NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-beijing.192.16x.x.xx3  192.16x.x.xx3  0/15                   0/15
cn-beijing.192.16x.x.xx1  192.16x.x.xx1  0/15                   0/15
cn-beijing.192.16x.x.xx2  192.16x.x.xx2  0/15                   0/15
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/45 (0%)

説明

GPU共有に関する詳細情報を照会するには、kubectl inspect cgpu -dコマンドを実行します。

次のYAMLテンプレートを使用して、GPU共有が有効になり、メモリ分離が無効になるコンテナーを作成します。

apiVersion: batch/v1
kind: Job
metadata:
  name: disable-cgpu
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: disable-cgpu
    spec:
      containers:
      - name: disable-cgpu
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        env:
        - name: CGPU_DISABLE # Disable the memory isolation feature of cGPU. 
          value: "true"
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            # The pod requests 3 GiB of GPU memory in total. 
            aliyun.com/gpu-mem: 3
        workingDir: /root
      restartPolicy: Never

説明

aliyun.com/gpu-mem: ジョブによって要求されるGPUメモリの量を指定します。
cGPUのメモリ分離機能を無効にするには、CGPU_DISABLEをtrueに設定します。

次のコマンドを実行して、cGPUによるGPUスケジューリングの結果を照会します。

kubectl inspect cgpu

期待される出力:

NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-beijing.192.16x.x.xx1  192.16x.x.xx1  0/15                   0/15
cn-beijing.192.16x.x.xx2  192.16x.x.xx2  0/15                   0/15
cn-beijing.192.16x.x.xx3  192.16x.x.xx3  3/15                   3/15
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
3/45 (6%)

新しく作成されたコンテナには、cn-beijing.192.16x.x.xx3ノードから3 GiBのGPUメモリが割り当てられます。

結果を確認する

次のいずれかの方法を使用して、cGPUのメモリ分離機能が無効になっているかどうかを確認できます。

方法1: 次のコマンドを実行して、アプリケーションログを照会します。
```
kubectl logs disable-cgpu-xxxx --tail=1
```
期待される出力:
```
2020-08-25 08:14:54.927965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15024 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
```
返されたログエントリは、コンテナ化されたアプリケーションが使用できるGPUメモリが15,024 MiBであることを示します。これは、cGPUのメモリ分離機能が無効になっていることを示します。 cGPUのメモリ分離機能が有効になっている場合、コンテナ化されたアプリケーションで検出できるGPUメモリの量は3 GiBです。

方法2: 次のコマンドを実行してコンテナにログインし、コンテナに割り当てられているGPUメモリの量を表示します。

kubectl exec disable-cgpu-xxxx nvidia-smi

期待される出力:

Tue Aug 25 08:23:33 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   33C    P0    55W / 300W |  15453MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

出力は、ホストのGPUメモリ容量が16,130 MiBであり、コンテナに割り当てられているGPUメモリの量が15,453 MiBであることを示しています。これは、cGPUのメモリ分離機能が無効になっていることを示します。 cGPUのメモリ分離機能が有効になっている場合、コンテナーに割り当てられるGPUメモリの量は3 GiBです。