All Products
Search
Document Center

Elastic GPU Service:Command reference

Last Updated:Oct 31, 2024

You can run FastGPU commands to efficiently deploy GPU-accelerated clusters in Alibaba Cloud and manage the lifecycle of resources. You can also run FastGPU commands to install deep learning environments, run code, view operational logs, and release resources in clusters.

Prerequisites

  • Python 3.6 or later is installed on a client.

    Note

    You can use an Elastic Compute Service (ECS) instance, an on-premises machine, or Alibaba Cloud Shell as a client to install FastGPU to build AI computing tasks.

  • The Alibaba Cloud AccessKey pair is obtained. For information, see Create an AccessKey pair.

Environment preparations

  1. Run the following command to install the FastGPU package:

    pip3 install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/fastgpu/fastgpu-1.1.5-py3-none-any.whl
  2. Run the following commands to configure environment variables.

    To configure environment variables, you need to obtain environment information such as the AccessKey pair of your Alibaba Cloud account, the default region, and the default zone. Run the following commands on your ECS instance, your on-premises machine, or Alibaba Cloud Shell to configure environment variables:

    export ALIYUN_ACCESS_KEY_ID=**** # Enter your AccessKey ID.
    export ALIYUN_ACCESS_KEY_SECRET=**** # Enter your AccessKey secret.
    export ALIYUN_DEFAULT_REGION=cn-hangzhou # Enter the ID of the region in which you want to use FastGPU.
    export ALIYUN_DEFAULT_ZONE=cn-hangzhou-i # Optional. Enter the ID of the zone in which you want to use FastGPU.

Commands

In the following table, {instance_name} in commands can specify the name of an instance. For example, if you set {instance_name} in a command to task0.my_job, the command is run only on the task0.my_job instance. You can also set {instance_name} in a command to a value that is enclosed in braces ({}), such as {my_job}. In this case, the command is run in the GPU-accelerated instance cluster whose name is suffixed with my_job.

Command

Description

Example

fastgpu [help,-h,--help]

Displays the description of all FastGPU commands.

fastgpu --help

fastgpu -h

fastgpu {command} --help

Displays the description of a specific FastGPU command.

fastgpu ls --help

fastgpu ls

Displays the instances that are created by FastGPU users. The following information is included:

  • instance name: the name of the instance.

  • age(hours): the duration from the point in time when the instance was created to the current point in time. Unit: hours.

  • public_ip: the public IP address of the instance.

  • private_ip: the private IP address of the instance.

  • GPU: the specifications and number of the GPUs.

  • instance_type: the instance type.

Parameter:

-a: displays all instances within your Alibaba Cloud account. If -a is added to the command, the command output also includes the Key-Owner and instance_id parameters. The Key-Owner parameter indicates the key pair, and the instance_id parameter indicates the ID of the instance.

  • To query the instance that is created by the current Linux account, run the following command:

    fastgpu ls

  • To query the instances that are created by different Linux accounts within your Alibaba Cloud account, run the following command:

    fastgpu ls -a

fastgpu create --config create.cfg
fastgpu create --name {instance_name} --machine {count} --instance_type {ins_type}

Creates an instance or a cluster.

Parameters:

  • -f, -c, or --config: the configuration file that you want to use to create the instance.

  • -n, or --name: the name of the instance.

  • -- image, or --image_name: the name of the image that you want to install on the instance. You can run the queryimage command to query the image.

  • --image_type: the type of the image. If you do not specify image_name, you can specify image_type to query an image. Valid values: Aliyun, Ubuntu, and CentOS.

  • -np, or --machines: the number of instances that you want to create.

  • -i, or --instance_type: the instance specifications, such as the vCPU, memory, and GPU model. You can run the querygpu command to query all instance specifications.

  • --system_disk_size: the size of the system disk. Unit: GB.

  • --data_disk_size: the size of the data disk. Unit: GB.

  • --skip_setup: skips the initialization of the instance.

  • -nas, --nas, or --enable-nas: allows a File Storage NAS (NAS) file system to be mounted on the instance. For information, see What is NAS?.

  • --zone_id: the ID of the zone in which you want to deploy the instance. By default, the system selects a zone ID. You can run the querygpu command to query the IDs of available zones.

  • --spot: creates a preemptible instance. For more information, see What are preemptible instances?.

  • --confirm_cost: skips the payment confirmation step.

  • --install_script: the command that is run on the instance after the instance is installed.

  • -vpc, --vpc, or --vpc_name: the name of the virtual private cloud (VPC) in which the instance resides.

  • -cuda, --install_cuda, or --cuda_install: automatically installs Compute Unified Device Architecture (CUDA) on the instance.

  • To create an Ubuntu instance, specify the instance name and instance type, and then allow the system to automatically install CUDA on the instance, run the following command:

    fastgpu create --name fastgpu_vm -np 1 --instance_type ecs.gn6v-c8g1.16xlarge --image_type ubuntu --install_cuda

  • To create an instance by using a configuration file, run the following command:

    fastgpu create -c config.cfg

fastgpu ssh {instance_name}

Uses SSH to connect to and log on to an instance.

Note

Before you can use SSH to connect to an instance, you must add the public IP address of an on-premises machine to the security group of the instance. We recommend that you run the fastgpu addip -a command to add the public IP address to the security group.

To connect to the task0.my_job instance by using SSH, run the following command:

fastgpu ssh task0.my_job

fastgpu scp /local/path/to/upload {instance_name}:/remote/path/to/save
fastgpu scp {instance_name}:/remote/path/to/copy /local/path/to/save

Copies a file from an on-premises machine to an instance, or copies a file from an instance to an on-premises machine.

  • To copy a file from an on-premises machine to an instance, run the following command:

    fastgpu scp /root/test.txt task0.my_job:/root/

  • To copy a file from an instance to an on-premises machine, run the following command:

    fastgpu scp task0.my_job:/home/cuda/ ~/cuda/

  • To copy a file from an on-premises machine to the /root directory of all instances in the cluster whose name is suffixed with my_job:

    fastgpu scp /root/test.txt {my_job}:/root/

fastgpu querygpu
fastgpu query
fastgpu query -gpu {gpu_type}
fastgpu query -np  {number of gpus per node}
fastgpu query -gpu {gpu_type} -np {number of gpus per node}

Queries the GPU-accelerated instance types that are supported by Alibaba Cloud.

Parameters:

  • -gpu: queries the instance types that use a specific GPU model.

    The following GPU models are supported: V100, P100, A10, T4, P4, M40 and so on.

  • -np: queries the instance types that are configured with a specific number of GPUs. The following number of GPUs are supported: 1, 2, 4, and 8.

  • To query all instance types, run one of the following commands:

    fastgpu querygpu

    fastgpu query

  • To query the instance types that use a V100 GPU, run the following command:

    fastgpu query -gpu "V100"

  • To query the instance types that are configured with four GPUs, run the following command:

    fastgpu query -np 4

fastgpu queryimage
fastgpu queryimage {os_type}

Queries the instance images that are supported by Alibaba Cloud.

Parameter:

os_type: the OS type that is supported by Alibaba Cloud. Valid values: CentOS, Ubuntu, Debian, SUSE, and Aliyun.

  • To query all images, run the following command:

    fastgpu queryimage

  • To query all versions of CentOS images, run the following command:

    fastgpu queryimage centos

fastgpu describe {instance_name}
fastgpu describe

Queries all properties of an instance. The properties include the GPU, image, memory size, creation time, key pair, status, and the number of vCPUs.

  • To query all properties of all instances, run the following command:

    fastgpu describe

  • To query all properties of the task0.my_job instance, run the following command:

    fastgpu describe task0.my_job

  • To query all properties of all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu describe {my_job}

fastgpu kill {instance_name}
fastgpu kill -y {instance_name}
fastgpu kill {instance_a_name} {instance_b_name} {instance_c_name}
fastgpu kill -f {instance_name}

Releases an instance.

Parameters:

  • -f: forcefully releases the instance.

  • -y: skips the confirmation step.

  • To release the task0.my_job instance that is in the stopped state, run the following command:

    fastgpu kill task0.my_job

  • To forcefully release the task0.my_job instance regardless of the status of the instance, run the following command:

    fastgpu kill -f task0.my_job

  • To forcefully release all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu kill -f {my_job}

fastgpu stop {instance_name}
fastgpu stop {instance_a_name} {instance_b_name} {instance_c_name}
fastgpu stop -f {instance_name}
fastgpu stop -k {instance_name}

Stops an instance. If you want to stop all instances in a cluster at a time, you can set {instance_name} to {Suffix of the cluster name} in the command.

Parameters:

  • -f: forcefully stops the instance.

  • -k: stops the instance but does not stop the billing for the instance.

  • -y: skips the confirmation step.

  • To stop the task0.my_job instance that is in the running state, run the following command:

    fastgpu stop task0.my_job

  • To forcefully stop the task0.my_job instance, run the following command:

    fastgpu stop -f task0.my_job

  • To forcefully stop all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu stop -f {my_job}

fastgpu start {instance_name}

Starts an instance.

Parameter:

-y: skips the confirmation step.

  • To start the task0.my_job instance, run the following command:

    fastgpu start task0.my_job

  • To start the cluster whose name is suffixed with my_job, run the following command:

    fastgpu start {my_job}

fastgpu mount {instance_name}
fastgpu mount {instance_name} {mount_target_domain}

Mounts a NAS file system to the /ncluster directory of an instance.

Parameter:

mount_target_domain: the mount target of the NAS file system. If you do not specify this parameter, the system automatically creates a mount target and mounts a NAS file system on the instance.

  • To automatically create a mount target of the NAS file system and mount the NAS file system on the task0.my_job instance, run the following command:

    fastgpu mount task0.my_job

  • To manually create a mount target of the NAS file system and mount the NAS file system on the task0.my_job instance, run the following command:

    fastgpu mount task0.my_job example.cn-hangzhou.nas.aliyuncs.com

  • To mount a NAS file system on all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu mount {my_job}

fastgpu run {instance_name} {cmd}

Runs a shell command on an instance.

Parameter:

cmd: the command that you want to run on the instance.

  • To query the IP address of the task0.my_job instance, run the following command:

    fastgpu run task0.my_job ifconfig

  • To query the IP addresses of all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu run {my_job} ifconfig

fastgpu addip {instance_name}
fastgpu addip {instance_name} {your_public_ip}
fastgpu addip {instance_name} {your_public_ip} {port_range}
fastgpu addip {instance_name} {your_public_ip} {port_range} {description}
fastgpu addip -a {your_public_ip} {port_range} {description}

Adds a public IP address to the security group of an instance so that the instance can be accessed from the public IP address.

Parameters:

  • your_public_ip: the public IP address from which the instance can be accessed.

  • port_range: the port range. Separate the start port and the end port with a forward slash (/).

  • description: the description of the public IP address that you want to add.

  • -a: the default security group.

  • To add the public IP address of an on-premises machine to the default security group and allow access from the public IP address to the instance over the port 22, run the following command:

    fastgpu addip -a

  • To add the public IP address of an on-premises machine to the security group of the task0.my_job instance and allow access from the public IP address to the instance over the port 22, run the following command:

    fastgpu addip task0.my_job

  • To enable access from an on-premises machine whose the IP address is 203.0.113.0 to the task0.my_job instance over the ports 2000 to 3000, run one of the following commands:

    fastgpu addip task0.my_job 203.0.113.0 2000/3000

    fastgpu addip task0.my_job 203.0.113.0 2000/3000 "open 2000-3000 port"

  • To enable access from an on-premises machine whose IP address is 203.0.113.0 to all instances in the cluster whose name is suffixed with my_job over the ports 2000 to 3000, run the following command:

    fastgpu addip {my_job} 203.0.113.0 2000/3000

fastgpu deleteip {instance_name}
fastgpu deleteip {instance_name} {your_public_ip}
fastgpu deleteip {instance_name} {your_public_ip} {port_range}
fastgpu deleteip -a

Removes an IP address from the security group of an instance.

Parameters:

  • your_public_ip: the public IP address that is added to the security group.

  • port_range: the port range. Separate the start port and the end port with a forward slash (/).

  • -a: removes all public IP addresses that are allowed to access the instance over the port 22.

  • To remove the public IP address of an on-premises machine from the security group of the task0.my_job instance, run the following command:

    fastgpu deleteip task0.my_job

  • To remove the IP address 203.0.113.0 of an on-premises machine from the security group of the task0.my_job instance, run the following command:

    fastgpu deleteip task0.my_job 203.0.113.0

  • To disable access from the IP address 203.0.113.0 of an on-premises machine to the task0.my_job instance over the ports 2000 to 3000, run the following command:

    fastgpu deleteip task0.my_job 203.0.113.0 2000/3000

  • To remove all IP addresses of an on-premises machine from the security group of the task0.my_job instance over the port 22, run the following command:

    fastgpu deleteip -a task0.my_job

  • To remove an on-premises machine whose IP address is 203.0.113.0 from all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu deleteip {my_job} 203.0.113.0

fastgpu queryip
fastgpu queryip -a
fastgpu queryip {instance_name}

Queries the IP addresses that are added to the security group of an instance. By default, the IP addresses that are allowed to access the instance over the port 22 are queried.

Parameters:

-a: queries all IP addresses that are allowed to access the instance over a port.

  • To query all IP addresses that are allowed to access an instance over the port 22, run the following command:

    fastgpu queryip

  • To query the IP addresses that are allowed to access the task0.my_job instance over the port 22, run the following command:

    fastgpu queryip task0.my_job

  • To query all IP addresses that are allowed to access the task0.my_job instance over a port, run the following command:

    fastgpu queryip -a task0.my_job

fastgpu addpub {string of id_rsa.pub}

Adds the public key of an on-premises machine to an instance.

Parameter:

string of id_rsa.pub: the path of the public key file.

To add ~/.ssh/id_rsa.pub to an instance, run the following command:

fastgpu addpub

fastgpu rename {instance_name} {instance_new_name}
fastgpu rename {instance_id} {instance_new_name}

Renames an instance.

Parameters:

  • instance_new_name: the new name of the instance.

  • instance_id: the ID of the instance. You can run the describe command to query the instance ID.

To change the instance name from task0.my_job to my_new_ins, run the following command:

fastgpu rename task0.my_job task0.my_new_ins

fastgpu tmux {instance_name}

Uses SSH to connect to an instance and uses the default tmux process.

To connect to the task0.my_job instance and create a tmux process, run the following command:

fastgpu tmux task0.my_job

fastgpu deletekeypair

Removes the SSH key pair of an on-premises machine.

Note

If the SSH key pair is being used by an instance, you cannot connect to the instance or may fail to query the instance after the SSH key pair is removed. If you want to query the instance, you must run the fastgpu ls -a command.

To remove the SSH key pair from the ~/.fastgpu/ directory, run the following command:

fastgpu deletekeypair

fastgpu createkeypair

Creates an SSH key pair for an on-premises machine. The SSH key pair is used when you create instances and connect to the instances in subsequent operations.

To create an SSH key pair in the ~/.fastgpu/ directory of an on-premises machine, run the following command:

fastgpu createkeypair

fastgpu attachkeypair {instance_name}

Attaches an SSH key pair to an instance.

  • To attach the SSH key pair in the ~/.fastgpu/ directory of an on-premises machine to the task0.my_job instance, run the following command:

    fastgpu attachkeypair task0.my_job

  • To attach the SSH key pair in the ~/.ncluster/ directory of an on-premises machine to all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu attachkeypair {my_job}

fastgpu detachkeypair {instance_name}

Detaches an SSH key pair from an instance.

Note

After the SSH key pair is detached from the instance, you cannot connect to the instance or query the instance. If you want to connect to or query the instance, we recommend that you run the attachkeypair command to attach the SSH key pair to the instance.

  • To detach an SSH key pair from the task0.my_job instance, run the following command:

    fastgpu detachkeypair task0.my_job

  • To detach an SSH key pair from all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu detachkeypair {my_job}

fastgpu notebooksample {instance_name} {passwd_of_login}

Creates and deploys a sample Jupyter Notebook project on an instance.

  • Default password: AIACC.

  • Project instance: tensorflow-1.14-python36.

Parameter:

passwd_of_login: the password of the Jupyter Notebook server.

To create and deploy a sample Jupyter Notebook project on the task0.my_job instance, run the following command:

fastgpu notebooksample task0.my_job

fastgpu cuda {instance_name} {gpu_driver_version} {cuda_version} {cudnn_version}

Installs an NVIDIA GPU driver, CUDA, and cuDNN on an instance. Default versions:

  • gpu_driver_version: 460.91.03.

  • CUDA: 11.2.2

  • cuDNN: 8.1.1

Parameters:

  • gpu_driver_version: the version of the NVIDIA GPU driver that you want to install.

  • cuda_version: the version of CUDA that you want to install.

  • cudnn_version: the version of cuDNN that you want to install.

  • To install CUDA of the default version on the task0.my_job instance, run the following command:

    fastgpu cuda task0.my_job

  • To install the NVIDIA GPU driver 460.91.03, CUDA 11.2.2, and cuDNN 8.1.1 on the task0.my_job instance, run the following command:

    fastgpu cuda task0.my_job 460.91.03 11.2.2 8.1.1

fastgpu conda {instance_name}
fastgpu conda {instance_name} -f {conda_yaml_file}
fastgpu conda {instance_name} -h
fastgpu conda {instance_name} --cuda 10.0 -tf -v 1.15.0

Installs Conda on an instance, and creates a virtual environment in which Python and CUDA of the specific versions are installed:

Parameters:

  • -h: displays help information.

  • -f or --yaml: the YAML file that you want to use to install Conda.

  • -cu or --cuda: the version of CUDA. Valid values: 11.0, 10.2, 10.1, and 10.0.

  • -py, or --python: the version of Python. Valid values: 3.5, 3.6, 3.7, and 3.8.

  • -tf or --tensorflow: uses TensorFlow as the main framework.

  • -pt or --pytorch: uses PyTorch as the main framework.

  • -mx or --mxnet: uses MXNet as the main framework.

  • -v, --vers, or --framework_version: the version of the main framework.

Note

You cannot specify the parameters that are relevant to the TensorFlow, PyTorch, and MXNet frameworks at a time.

  • To install Conda on the task0.my_job instance on which no virtual environment is installed, run the following command:

    fastgpu conda task0.my_job

  • To install Conda on all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu conda {my_job}

  • To install Conda on the task0.my_job instance and create a virtual environment in which Python 3.6, CUDA 11.0, and PyTorch 1.7.0 are installed, run the following command:

    fastgpu conda -py 3.6 -cu 11.0 -pt -v 1.7.0

fastgpu allconda {instance_name}

Creates all Conda environments that are supported on an instance.

Note

This command takes a long period of time to run.

  • To create all Conda environments that are supported on the task0.my_job instance, run the following command:

    fastgpu allconda task0.my_job

  • To create all Conda environments that are supported on all instances in the cluster whose name is suffixed with my_job, run the following command:

    fastgpu allconda {my_job}

fastgpu replaceimage {instance_name} {image_id}

Replaces the image of an instance.

Parameters:

image_id: the name or ID of the new image.

  • To replace the image of the task0.my_job instance by using a CentOS image, run the following command:

    fastgpu replaceimage task0.my_job centos_8_2_x64_20G_alibase_20210712.vhd

  • To replace the images of all instances in the cluster whose name is suffixed with my_job by using a CentOS image, run the following command:

    fastgpu replaceimage {my_job} centos_8_2_x64_20G_alibase_20210712.vhd

Sample configuration file

The following sample code provides an example on the create.cfg sample configuration file that is used in the fastgpu create command. For more information about the parameters in the command, see the "fastgpu create" command of this topic.

[fastgpu]
name=fastgpu-v100
machines=1
system_disk_size=500
data_disk_size=0
image_name=
image_type=ubuntu_18_04
instance_type=ecs.gn6v-c8g1.2xlarge
spot=False
confirm_cost=False
mount_nas=True
vpc_name=fastgpu-vpc
install_cuda=True

[cmd]
install_script=pwd