AI computing fast deployment toolkit FastGPU - Elastic GPU Service

FastGPU is a set of fast deployment tools provided by Alibaba Cloud for AI computing. FastGPU provides convenient interfaces and automatic tools that you can use to deploy AI training and inference tasks on Alibaba Cloud IaaS resources within a short period of time.

Overview

FastGPU plays an important role in connecting your offline AI algorithms to large numbers of online GPU computing resources of Alibaba Cloud. You can use FastGPU to easily build AI computing tasks on Alibaba Cloud IaaS resources. When you use FastGPU to build AI computing tasks, you do not need to deploy computing, storage, or network resources on the IaaS layer. FastGPU automatically adapts, deploys, and runs code for your tasks.

FastGPU provides the following components:

ncluster: the runtime component. The component provides convenient interfaces that you can use to quickly deploy offline AI training and inference scripts on Alibaba Cloud IaaS resources. For more information about the component, see Use FastGPU SDK for Python.
ecluster: the command line component. The component provides command line-based tools that you can use to manage the running status of Alibaba Cloud AI computing tasks and the lifecycle of clusters. For more information about the component, see Command reference.

Architecture

The following figure shows the architecture of FastGPU. fastgpu-arc

Underlying layer: the interaction layer where API operations are called to use Alibaba Cloud resources.
Intermediate layer: the Alibaba Cloud backend layer that is formed after the objects for IaaS layer resources involved in the running AI tasks are encapsulated.
Upper layer: the user control layer that is formed after AI tasks are mapped to Alibaba Cloud instance resources.
To build IaaS-level AI computing tasks on Alibaba Cloud within a short period of time, you need to only interact with the user control layer.

Flowchart

For example, if you use FastGPU to complete a training task, the following stages are involved:

Stage 1: You start to use FastGPU.
Upload your training dataset to Object Storage Service (OSS) and create an Elastic Compute Service (ECS) instance as a development host to store the training code.
Stage 2: FastGPU immediately builds computing tasks.
1. FastGPU deploys the cluster on the development host and creates the resources required for the tasks. The resources include computing resources such as CPUs and GPUs, storage resources such as cloud disks and File Storage NAS (NAS) file systems, and interactive resources such as Tmux and TensorBoard.
2. The distributed training task is automatically started. During the training process, you can view the training status in real time by using interactive resources.
3. The resources are automatically released after the distributed training task is complete.
Stage 3: You perform subsequent operations after the training task is complete.
Store the trained models and log files in the cloud disks or OSS resources of the development host. This way, you can view the task results.