FastGPU SDK for Pythonの使用に関する注意事項 - Elastic GPU Service

FastGPU SDK for Pythonを使用して、FastGPUをAIトレーニングまたは推論スクリプトに統合できます。これにより、クラウドリソースを効率的にデプロイおよび管理できます。このトピックでは、FastGPU SDK for Pythonの使用方法について説明します。

前提条件

Python 3.6以降がクライアントにインストールされます。
説明
Alibaba CloudのCloud Shell、Elastic Compute Service (ECS) インスタンス、またはオンプレミスマシンをクライアントとして使用して、FastGPUをインストールし、AIコンピューティングタスクを構築できます。

Alibaba Cloud AccessKeyペアが取得されました。詳細については、「AccessKey の作成」をご参照ください。

環境を準備する

次のコマンドを実行して、FastGPUソフトウェアパッケージをインストールします。

pip3 install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/fastgpu/fastgpu-1.1.5-py3-none-any.whl

次のコマンドを実行して環境変数を設定します。
環境変数を設定する前に、Alibaba CloudアカウントのAccessKeyペア、デフォルトリージョン、デフォルトゾーンなどの必要な情報をCloud Shell、ECSインスタンス、またはオンプレミスマシンから取得する必要があります。詳細については、「リージョンとゾーン」をご参照ください。
```
export ALIYUN_ACCESS_KEY_ID=****          # The AccessKey ID.
export ALIYUN_ACCESS_KEY_SECRET=****      # The AccessKey secret.
export ALIYUN_DEFAULT_REGION=cn-hangzhou  # The ID of the region that you want to use.
export ALIYUN_DEFAULT_ZONE=cn-hangzhou-i  # (Optional) The ID of the zone that you want to use.
```
次のコマンドを実行して、FastGPUモジュールをPythonコードにインポートします。
```
import fastgpu
```

インスタンスの作成またはアクセス

fastgpu.make_jobメソッドは、ルールに基づいてインスタンスクラスターを自動的に作成します。インスタンスクラスターが既に存在する場合、既存のインスタンスクラスターが返されます。

job = fastgpu.make_job(
    name: str="",             # (Required) The name of the instance cluster. 
    instance_type: str="",    # (Required) The instance type. 
    num_tasks: int=0,         # The number of instances that you want to create.
    install_script: str="",   # The initialization command.
    image_name: str="",       # The image name.
    image_type: str="",       # The image type.
    disk_size: int=500,       # The size of the data disk.
    spot: bool=False,         # Specifies whether to create preemptible instances.
    confirm_cost: bool=False, # Specifies whether to skip the consumption confirmation step.
    install_cuda: bool=False, # Specifies whether to automatically install a GPU driver.
    mount_nas: bool=False     # Specifies whether to automatically mount an Apsara File Storage NAS (NAS) file system.
)

次の表に、コード内のパラメーターを示します。

パラメーター	必須	説明	サンプル設定
name	あり	インスタンスクラスターの名前。このパラメーターはデフォルトで空となります。デフォルト値は、インスタンスが既存のリソースから取得されることを示します。	インスタンス名としてfastgpu_testを使用する場合のサンプル設定: `name="fastgpu_test"`
instance_type	あり	インスタンスのインスタンスタイプ。 `fastgpu querygpu`コマンドを実行して、GPU高速化インスタンスタイプを照会できます。詳細については、「GPU機能を備えたインスタンスファミリー」をご参照ください。	1つのV100 GPUで設定されたインスタンスタイプを使用する場合のサンプル設定: `instance_type="ecs.gn6v-c8g1.2xlarge"`
num_tasks	なし	作成するインスタンスの数。デフォルト値は 1 です。	1つのインスタンスを作成するときのサンプル設定: `num_tasks=1`
install_script	なし	インスタンスの初期化スクリプト。このパラメーターはデフォルトで空となります。デフォルト値は、コマンドが実行されないことを示します。	インスタンスの初期化後にSSHサービスを開始するときのサンプル設定: `install_script="systemctl start sshd"`
image_name	なし	インスタンスのイメージ名。このパラメーターはデフォルトで空となります。デフォルト値は、Alibaba Cloud Linux 2.1903がデフォルトイメージ名として使用されていることを示します。 `fastgpu queryimage`コマンドを実行して、イメージ名を照会できます。	イメージ名としてCentOSを使用する場合のサンプル設定: `image_name="centos_8_5_x64_20G_alibase_202111129.vhd"`
image_type	なし	インスタンスのイメージタイプ。このパラメーターは、`"aliyun" 、"ubuntu" 、"centos"` などのオペレーティングシステムの配布、または `"ubuntu_18_04" 、"centos_7_9"` などのオペレーティングシステムのバージョンに設定できます。	Ubuntu 16.04イメージを使用する場合のサンプル設定: `image_type="ubuntu_16_04"`
disk_size	なし	データディスクのサイズ。デフォルト値: 500。単位は GB です。	サイズが500 GBのデータディスクを使用する場合の構成例: `disk_size=500`
スポット	なし	プリエンプティブルインスタンスを作成するかどうかを指定します。デフォルト値はFalseです。	システムがプリエンプティブルインスタンスを作成するときのサンプル設定: `spot=True`
confirm_cost	なし	消費確認ステップをスキップするかどうかを指定します。デフォルト値はFalseです。 Falseの値は、システムが消費確認ステップをスキップしないことを示す。この場合、インスタンスの作成中に操作を確認するように求められます。	システムが消費確認ステップをスキップするときのサンプル設定: `confirm_cost=True`
install_cuda	なし	GPUドライバーを自動的にインストールするかどうかを指定します。デフォルト値はFalseです。 Falseの値は、システムがGPUドライバーを自動的にインストールしないことを示します。	システムがGPUドライバーを自動的にインストールするときの構成例: `install_cuda=True`
mount_nas	なし	NASファイルシステムを自動的にマウントするかどうかを指定します。詳細については、「」をご参照ください。NASとは何ですか?	システムがNASファイルシステムを自動的にマウントするときの構成例: `mount_nas=True`

戻り値: インスタンスクラスターを表すジョブオブジェクトが返されます。インスタンスクラスター内の特定のインスタンスにアクセスするには、関連するタスクにアクセスできます。 Jobオブジェクトには複数のタスクを含めることができます。次の図は、Jobオブジェクトとタスクの関係を示しています。

job

job = fastgpu.make_job(...) # Create a Job object.
job.run("ls -l")            # Run the ls -l command for an instance cluster.
job.tasks[0].run("ls -l")   # Run the ls -l command for an instance, such as the instance that corresponds to Task0.

サンプルコード: 次のサンプルコードは、2つのタスクを含むfastgpu_testという名前のJobオブジェクトを作成する方法の例を示しています。各タスクは、インスタンスに対応する。 Jobオブジェクトのタスクにアクセスすることで、作成されたインスタンスにアクセスできます。サンプルコード：

job = fastgpu.make_job(
    name="fastgpu_test",                   # The name of the instance cluster.
    num_tasks=2,                           # The number of instances. In this example, two instances are created.
    instance_type="ecs.gn6v-c8g1.2xlarge", # The instance type.
    image_type="ubuntu_18_04",             # The image type of the instances. In this example, an Ubuntu 18.04 image is used.
    disk_size=500,                         # The size of the data disk. Unit: GB. In this example, the data disk whose size is 500 GB is used.
    confirm_cost=True,                     # Specifies whether to skip the consumption confirmation step.
    spot=True,                             # Specifies whether to create preemptible instances.
    install_cuda=True,                     # Specifies whether to automatically install a GPU driver.
    mount_nas=True                         # Specifies whether to automatically mount a NAS file system.
)
task1 = job.tasks[0]
task2 = job.tasks[1]

コマンドの実行

次のセクションでは、インスタンスクラスターまたはインスタンスに対してコマンドを実行する方法について説明します。コマンドの実行後、出力は指定されたディレクトリに格納されます。

# Run a command for an instance cluster.
job.run(cmd,                        # The command that you want to run.
         sudo=False,                # Specifies whether to use the administrator permissions to run the command.
         non_blocking=False,        # Specifies whether to run the command in a non-blocking manner.
         ignore_errors=False,       # Specifies whether to ignore errors. By default, if an error occurs, the system throws an exception.
         max_wait_sec=365*24*3600,  # The maximum timeout period.
         show=False,                # Specifies whether to return the output after the command is run.
         show_realtime=False        # Specifies whether to display the output in real time.
       )

# Run a command for an instance.
job.tasks[i].run(cmd, ...)

次の表に、コード内のパラメーターを示します。

パラメーター

説明

サンプル設定

sudo

管理者権限を使用してコマンドを実行するかどうかを指定します。

デフォルト値はFalseです。 Falseの値は、システムがコマンドの実行に管理者権限を使用しないことを示します。

システムが管理者権限を使用してコマンドを実行する場合の構成例:

sudo=True

non_blocking

コマンドを非ブロックで実行するかどうかを指定します。

デフォルト値はFalseです。 Falseの値は、コマンドが実行されるまでシステムが待機することを示します。

システムが非ブロック方式でコマンドを実行する場合の構成例:

non_blocking=True

無視_エラー

エラーを無視するかどうかを指定します。

デフォルト値はFalseです。 Falseの値は、エラーが発生した場合にシステムがプログラムを終了することを示します。エラーが報告されると、例外がスローされます。

システムがエラーを無視する場合のサンプル構成:

ignore_errors=True

max_wait_sec

最大タイムアウト期間。単位は秒です。

デフォルト値は365*24*3600で、これは1年に相当します。

最大タイムアウト期間を1時間に設定した場合のサンプル設定:

max_wait_sec=3600

ショー

コマンドの実行後に出力を返すかどうかを指定します。

デフォルト値はFalseです。

コマンドの実行後にシステムが出力を返す場合の構成例:

show=True

show_realtime

出力をリアルタイムで表示するかどうかを指定します。

デフォルト値はFalseです。

システムがリアルタイムで出力を表示する場合の構成例:

show_realtime=True

サンプルコード：

# Run the ls command for an instance cluster to query the files and folders in the working directory of each instance.
job.run("ls")
# Run the ls command for an instance to query the files and folders in the working directory of Instance i.
job.tasks[i].run("ls")

ファイルのアップロードまたはダウンロード

次のセクションでは、ファイルをインスタンスクラスターまたはインスタンスにアップロードする方法について説明します。

# Upload a file to an instance cluster.
job.upload(local_fn: str, remote_fn: str="", dont_overwrite: bool=False)
# Upload a file to Instance i in the instance cluster.
job.tasks[i].upload(local_fn: str, remote_fn: str="", dont_overwrite: bool=False)

次の表に、コード内のパラメーターを示します。

パラメーター	必須	説明	サンプル設定
Iocal_fn	あり	ファイルのソースパス。	ファイルをアップロードするソースパスの構成例: `local_fn="/root/test_download.fn"`
remote_fn	なし	ファイルの宛先パス。このパラメーターはデフォルトで空となります。デフォルト値は、local_fnパラメーターで指定されたパスにファイルがアップロードされることを示します。	インスタンスのパスを宛先パスとして使用する場合の構成例: `remote_fn="/root/test.txt"`
dont_overwrite	なし	既存のファイルを保持するかどうかを指定します。デフォルト値はFalseです。 Falseの値は、システムが既存のファイルを自動的に上書きすることを示します。	システムが既存のファイルを保持する場合の構成例: `dont_overwrite=True`

次のセクションでは、インスタンスクラスターまたはインスタンスからオンプレミスマシンにファイルをダウンロードする方法について説明します。

# Download a file from an instance cluster.
job.download(remote_fn, local_fn: str="")
# Download a file from Instance i in the instance cluster.
job.tasks[i].download(remote_fn, local_fn: str="")

重要

3つ以上のインスタンスを含むインスタンスクラスターからファイルをダウンロードすると、ファイルの競合が発生する可能性があります。 3つ以上のインスタンスを含むインスタンスクラスターからファイルをダウンロードしないことを推奨します。

次の表に、コード内のパラメーターを示します。

パラメーター

必須

説明

サンプル設定

remote_fn

あり

ファイルのソースパス。

インスタンスiのパスをソースパスとして使用する場合の設定例:

remote_fn="/root/test.txt"

local_fn

なし

ファイルの宛先パス。

このパラメーターはデフォルトで空となります。デフォルト値は、remote_fnパラメーターで指定されたパスにファイルがダウンロードされることを示します。

宛先パスとしてオンプレミスのファイルパスを使用する場合のサンプル構成:

local_fn="/root/test_download.fn"

サンプルコード: 次のサンプルコードは、インスタンスクラスター内のすべてのインスタンスにファイルをアップロードし、インスタンスクラスター内のインスタンスからオンプレミスマシンにファイルをダウンロードする方法の例を示しています。

# Upload a file from the /root/test.txt path to the /root/ path of all instances in an instance cluster.
job.upload("/root/test.txt")
# Download the file from Instance 0 to the current path of your on-premises machine.
job.tasks[0].download("/root/test.txt", "./test.txt")

インスタンスの停止

次のセクションでは、インスタンスクラスターまたはインスタンスを停止する方法について説明します。

# Stop all instances in an instance cluster. 
job.stop(
    keep=False, # Specifies whether to continue the billing for all instances in the instance cluster after they are stopped.
    force=False # Specifies whether to forcefully stop all instances in the instance cluster.
)

# Stop Instance i in an instance cluster.
job.tasks[i].stop(
    keep=False, # Specifies whether to continue the billing for an instance after it is stopped.
    force=False # Specifies whether to forcefully stop an instance.
)

サンプルコード：

job.stop(force=True, keep=True) # Forcefully stops all instances in the instance cluster and continues the billing for the instances.

次の表に、コード内のパラメーターを示します。

パラメーター

説明

サンプル設定

保つ

1つ以上のインスタンスが停止した後に課金を継続するかどうかを指定します。

デフォルト値はFalseです。値がFalseの場合、1つ以上のインスタンスが停止した後もシステムが課金を継続しないことを示します。

1つ以上のインスタンスが停止した後もシステムが課金を継続する場合のサンプル設定:

keep=True

力

1つ以上のインスタンスを強制停止するかどうかを指定します。

デフォルト値はFalseです。 Falseの値は、システムが1つ以上のインスタンスを強制的に停止しないことを示します。この場合、プログラムが終了しないとシステムがスタックする可能性があります。

システムが1つ以上のインスタンスを強制的に停止する場合のサンプル設定:

force=True

インスタンスのリリース

次のセクションでは、インスタンスクラスターまたはインスタンスを永続的にリリースして、1つ以上のインスタンスによって占有されているリソースをリリースする方法について説明します。

重要

インスタンスが完全にリリースされると、次のルールが関連付けられたリソースに適用されます。インスタンスID、静的パブリックIPアドレス、システムディスク、およびインスタンスのリリースディスクが設定されたデータディスクなどの一部のリソースはリリースされ、復元できません。 elastic IPアドレス (EIP) や、リリースディスクが設定されていないデータディスクなどの一部のリソースは、インスタンスから自動的に削除されます。リリース操作を実行するときは注意してください。

job.kill()          # Release all instances in an instance cluster.
job.tasks[i].kill() # Release an instance.

サンプルコード：

# Forcefully stop and release an instance cluster and all of its instances, including running instances.
job.kill(force=True)
# Release a single instance that is in the stopped state.
job.tasks[i].kill()

次の表に、コード内のパラメーターを示します。

パラメーター

説明

サンプル設定

力

インスタンスクラスターとそのすべてのインスタンスを強制停止するかどうかを指定します。

デフォルト値はFalseです。 Falseの値は、システムが実行中のインスタンスをリリースしないことを示します。

システムがインスタンスクラスターとそのすべてのインスタンスを強制的に停止する場合の構成例:

force=True