Usage notes of FastGPU SDK for Python - Elastic GPU Service

You can use FastGPU SDK for Python to integrate FastGPU into your AI training or inference scripts. This way, you can deploy and manage cloud resources in an efficient manner. This topic describes how to use FastGPU SDK for Python.

Prerequisites

Python 3.6 or later is installed on your client.
Note
You can use Cloud Shell of Alibaba Cloud, an Elastic Compute Service (ECS) instance, or an on-premises machine as a client to install FastGPU and build AI computing tasks.

An Alibaba Cloud AccessKey pair is obtained. For more information, see Create an AccessKey pair.

Prepare an environment

Run the following command to install the FastGPU software package:

pip3 install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/fastgpu/fastgpu-1.1.5-py3-none-any.whl

Run the following commands to configure environment variables.
Before you configure environment variables, you must obtain the required information, such as the AccessKey pair of your Alibaba Cloud account, the default region, and the default zone, from your Cloud Shell, ECS instance, or on-premises machine. For more information, see Regions and Zones.
```
export ALIYUN_ACCESS_KEY_ID=****          # The AccessKey ID.
export ALIYUN_ACCESS_KEY_SECRET=****      # The AccessKey secret.
export ALIYUN_DEFAULT_REGION=cn-hangzhou  # The ID of the region that you want to use.
export ALIYUN_DEFAULT_ZONE=cn-hangzhou-i  # (Optional) The ID of the zone that you want to use.
```
Run the following command to import the FastGPU module into the Python code:
```
import fastgpu
```

Create or access instances

The fastgpu.make_job method automatically creates an instance cluster based on rules. If the instance cluster already exists, the existing instance cluster is returned.

job = fastgpu.make_job(
    name: str="",             # (Required) The name of the instance cluster. 
    instance_type: str="",    # (Required) The instance type. 
    num_tasks: int=0,         # The number of instances that you want to create.
    install_script: str="",   # The initialization command.
    image_name: str="",       # The image name.
    image_type: str="",       # The image type.
    disk_size: int=500,       # The size of the data disk.
    spot: bool=False,         # Specifies whether to create preemptible instances.
    confirm_cost: bool=False, # Specifies whether to skip the consumption confirmation step.
    install_cuda: bool=False, # Specifies whether to automatically install a GPU driver.
    mount_nas: bool=False     # Specifies whether to automatically mount a File Storage NAS (NAS) file system.
)

The following table describes the parameters in the code.

Parameter	Required	Description	Sample configuration
name	Yes	The name of the instance cluster. This parameter is empty by default. The default value indicates that instances are obtained from existing resources.	Sample configuration when you use fastgpu_test as the instance name: `name="fastgpu_test"`
instance_type	Yes	The instance type of the instances. You can run the `fastgpu querygpu` command to query GPU-accelerated instance types. For more information, see Instance families with GPU capabilities.	Sample configuration when you use an instance type that is configured with one V100 GPU: `instance_type="ecs.gn6v-c8g1.2xlarge"`
num_tasks	No	The number of instances that you want to create. Default value: 1.	Sample configuration when you create one instance: `num_tasks=1`
install_script	No	The initialization script of the instances. This parameter is empty by default. The default value indicates that no command is run.	Sample configuration when the SSH service is started after the instances are initialized: `install_script="systemctl start sshd"`
image_name	No	The image name of the instances. This parameter is empty by default. The default value indicates that Alibaba Cloud Linux 2.1903 is used as the default image name. You can run the `fastgpu queryimage` command to query image names.	Sample configuration when you use CentOS as the image name: `image_name="centos_8_5_x64_20G_alibase_202111129.vhd"`
image_type	No	The image type of the instances. You can set this parameter to an operating system distribution such as `"aliyun", "ubuntu", or "centos"`, or an operation system version, such as `"ubuntu_18_04" or "centos_7_9"`.	Sample configuration when you use an Ubuntu 16.04 image: `image_type="ubuntu_16_04"`
disk_size	No	The size of the data disk. Default value: 500. Unit: GB.	Sample configuration when you use the data disk whose size is 500 GB: `disk_size=500`
spot	No	Specifies whether to create preemptible instances. The default value is False.	Sample configuration when the system creates preemptible instances: `spot=True`
confirm_cost	No	Specifies whether to skip the consumption confirmation step. The default value is False. A value of False indicates that the system does not skip the consumption confirmation step. In this case, you are prompted to confirm the operation during instance creation.	Sample configuration when the system skips the consumption confirmation step: `confirm_cost=True`
install_cuda	No	Specifies whether to automatically install a GPU driver. The default value is False. A value of False indicates that the system does not automatically install a GPU driver.	Sample configuration when the system automatically installs a GPU driver: `install_cuda=True`
mount_nas	No	Specifies whether to automatically mount a NAS file system. For more information, see What is NAS?	Sample configuration when the system automatically mounts a NAS file system: `mount_nas=True`

Return value: A Job object that represents an instance cluster is returned. To access a specific instance in the instance cluster, you can access the relevant task. A Job object can contain multiple tasks. The following figure shows the relationship between a Job object and tasks.

job

job = fastgpu.make_job(...) # Create a Job object.
job.run("ls -l")            # Run the ls -l command for an instance cluster.
job.tasks[0].run("ls -l")   # Run the ls -l command for an instance, such as the instance that corresponds to Task0.

Sample code: The following sample code provides an example on how to create a Job object named fastgpu_test that contains two tasks. Each task corresponds to an instance. You can access the created instances by accessing the tasks of the Job object. Sample code:

job = fastgpu.make_job(
    name="fastgpu_test",                   # The name of the instance cluster.
    num_tasks=2,                           # The number of instances. In this example, two instances are created.
    instance_type="ecs.gn6v-c8g1.2xlarge", # The instance type.
    image_type="ubuntu_18_04",             # The image type of the instances. In this example, an Ubuntu 18.04 image is used.
    disk_size=500,                         # The size of the data disk. Unit: GB. In this example, the data disk whose size is 500 GB is used.
    confirm_cost=True,                     # Specifies whether to skip the consumption confirmation step.
    spot=True,                             # Specifies whether to create preemptible instances.
    install_cuda=True,                     # Specifies whether to automatically install a GPU driver.
    mount_nas=True                         # Specifies whether to automatically mount a NAS file system.
)
task1 = job.tasks[0]
task2 = job.tasks[1]

Run a command

The following section describes how to run a command for an instance cluster or instance. After the command is run, the output is stored in the specified directory.

# Run a command for an instance cluster.
job.run(cmd,                        # The command that you want to run.
         sudo=False,                # Specifies whether to use the administrator permissions to run the command.
         non_blocking=False,        # Specifies whether to run the command in a non-blocking manner.
         ignore_errors=False,       # Specifies whether to ignore errors. By default, if an error occurs, the system throws an exception.
         max_wait_sec=365*24*3600,  # The maximum timeout period.
         show=False,                # Specifies whether to return the output after the command is run.
         show_realtime=False        # Specifies whether to display the output in real time.
       )

# Run a command for an instance.
job.tasks[i].run(cmd, ...)

The following table describes the parameters in the code.

Parameter

Description

Sample configuration

sudo

Specifies whether to use the administrator permissions to run the command.

The default value is False. A value of False indicates that the system does not use the administrator permissions to run the command.

Sample configuration when the system uses the administrator permissions to run the command:

sudo=True

non_blocking

Specifies whether to run the command in a non-blocking manner.

The default value is False. A value of False indicates that the system waits until the command is run.

Sample configuration when the system runs the command in a non-blocking manner:

non_blocking=True

ignore_errors

Specifies whether to ignore errors.

The default value is False. A value of False indicates that the system terminates the program if an error occurs. An exception is thrown when an error is reported.

Sample configuration when the system ignores errors:

ignore_errors=True

max_wait_sec

The maximum timeout period. Unit: seconds.

The default value is 365*24*3600, which is equivalent to one year.

Sample configuration when you set the maximum timeout period to 1 hour:

max_wait_sec=3600

show

Specifies whether to return the output after the command is run.

The default value is False.

Sample configuration when the system returns the output after the command is run:

show=True

show_realtime

Specifies whether to display the output in real time.

The default value is False.

Sample configuration when the system displays the output in real time:

show_realtime=True

Sample code:

# Run the ls command for an instance cluster to query the files and folders in the working directory of each instance.
job.run("ls")
# Run the ls command for an instance to query the files and folders in the working directory of Instance i.
job.tasks[i].run("ls")

Upload or download a file

The following section describes how to upload a file to an instance cluster or instance.

# Upload a file to an instance cluster.
job.upload(local_fn: str, remote_fn: str="", dont_overwrite: bool=False)
# Upload a file to Instance i in the instance cluster.
job.tasks[i].upload(local_fn: str, remote_fn: str="", dont_overwrite: bool=False)

The following table describes the parameters in the code.

Parameter	Required	Description	Sample configuration
Iocal_fn	Yes	The source path of the file.	Sample configuration of the source path from which you upload the file: `local_fn="/root/test_download.fn"`
remote_fn	No	The destination path of the file. This parameter is empty by default. The default value indicates that the file is uploaded to the path specified by the local_fn parameter.	Sample configuration when you use the path of an instance as the destination path: `remote_fn="/root/test.txt"`
dont_overwrite	No	Specifies whether to retain an existing file. The default value is False. A value of False indicates that the system automatically overwrites an existing file.	Sample configuration when the system retains an existing file: `dont_overwrite=True`

The following section describes how to download a file from an instance cluster or an instance to an on-premises machine.

# Download a file from an instance cluster.
job.download(remote_fn, local_fn: str="")
# Download a file from Instance i in the instance cluster.
job.tasks[i].download(remote_fn, local_fn: str="")

Important

If you download files from an instance cluster that contains more than two instances, file conflicts may occur. We recommend that you do not download files from an instance cluster that contains more than two instances.

The following table describes the parameters in the code.

Parameter

Required

Description

Sample configuration

remote_fn

Yes

The source path of the file.

Sample configuration when you use a path of Instance i as the source path:

remote_fn="/root/test.txt"

local_fn

The destination path of the file.

This parameter is empty by default. The default value indicates that the file is downloaded to the path specified by the remote_fn parameter.

Sample configuration when you use an on-premises file path as the destination path:

local_fn="/root/test_download.fn"

Sample code: The following sample code provides an example on how to upload a file to all instances in an instance cluster and download the file from an instance in the instance cluster to your on-premises machine.

# Upload a file from the /root/test.txt path to the /root/ path of all instances in an instance cluster.
job.upload("/root/test.txt")
# Download the file from Instance 0 to the current path of your on-premises machine.
job.tasks[0].download("/root/test.txt", "./test.txt")

Stop an instance

The following section describes how to stop an instance cluster or instance.

# Stop all instances in an instance cluster. 
job.stop(
    keep=False, # Specifies whether to continue the billing for all instances in the instance cluster after they are stopped.
    force=False # Specifies whether to forcefully stop all instances in the instance cluster.
)

# Stop Instance i in an instance cluster.
job.tasks[i].stop(
    keep=False, # Specifies whether to continue the billing for an instance after it is stopped.
    force=False # Specifies whether to forcefully stop an instance.
)

Sample code:

job.stop(force=True, keep=True) # Forcefully stops all instances in the instance cluster and continues the billing for the instances.

The following table describes the parameters in the code.

Parameter

Description

Sample configuration

keep

Specifies whether to continue the billing for one or more instances after they are stopped.

The default value is False. A value of False indicates that the system does not continue the billing for one or more instances after they are stopped.

Sample configuration when the system continues the billing for one or more instances after they are stopped:

keep=True

force

Specifies whether to forcefully stop one or more instances.

The default value is False. A value of False indicates that the system does not forcefully stop one or more instances. In this case, the system may be stuck when the program fails to exit.

Sample configuration when the system forcefully stops one or more instances:

force=True

Release instances

The following section describes how to permanently release an instance cluster or instance to release the resources occupied by one or more instances.

Important

When an instance is permanently released, the following rules apply to its associated resources: Some resources such as the instance ID, static public IP address, system disk, and data disks with Release Disk with Instance configured are released and cannot be recovered. Some resources such as the elastic IP address (EIP) and data disks without Release Disk with Instance configured are automatically removed from the instance. Exercise caution when you perform a release operation.

job.kill()          # Release all instances in an instance cluster.
job.tasks[i].kill() # Release an instance.

Sample code:

# Forcefully stop and release an instance cluster and all of its instances, including running instances.
job.kill(force=True)
# Release a single instance that is in the stopped state.
job.tasks[i].kill()

The following table describes the parameters in the code.

Parameter

Description

Sample configuration

force

Specifies whether to forcefully stop an instance cluster and all of its instances.

The default value is False. A value of False indicates that the system does not release running instances.

Sample configuration when the system forcefully stops an instance cluster and all of its instances:

force=True