All Products
Search
Document Center

Elastic GPU Service:Overview of Alibaba Cloud heterogeneous computing services

Last Updated:Aug 22, 2024

Alibaba Cloud heterogeneous computing services provide a complete system that is integrated with software and hardware to help you allocate and scale resources in a flexible and elastic manner, increase computing power, and control costs for AI business.

Heterogeneous computing

Heterogeneous computing is a system that consists of computing units of different instruction sets and architecture types. The Alibaba Cloud heterogeneous computing service family includes Elastic GPU Service. Heterogeneous computing services provide dedicated hardware that is used in scenarios for which the hardware is best suited. In specific scenarios, heterogeneous computing services can provide higher efficiency and cost-effectiveness than regular Elastic Compute Service (ECS) instances. Heterogeneous computing balances performance, costs, and power consumption. Dedicated hardware of heterogeneous computing services can reduce power consumption to achieve optimal performance and costs.

The rapid development of AI technologies such as deep learning, result in increasingly complex but accurate AI computing models and a significant increase in the demand for computing power and performance. Against this backdrop, an increasing number of AI computing services use heterogeneous computing to accelerate performance. The cloud-based AI accelerators that are developed by Alibaba Cloud for heterogeneous computing services use a centralized framework to accelerate the performance of mainstream AI computing frameworks such as TensorFlow, PyTorch, MxNet, and Caffe and optimize the performance of Ethernet and heterogeneous accelerators.

Heterogeneous computing service family

The Alibaba Cloud heterogeneous computing service family consists of services, such as Elastic GPU Service and FaaS. Alibaba Cloud also provides DeepGPU, which is a collection of tools that provide enhanced GPU computing capabilities. AIACC 2.0-AIACC Communication Speeding (AIACC-ACSpeed), AIACC Graph Speeding (AGSpeed), FastGPU, and cGPU.

  • Elastic GPU Service

    Elastic GPU Service provides GPU-accelerated instances, which are computing servers based on GPUs. GPUs have unique advantages over CPUs in mathematical and geometric computations such as floating-point and parallel computing and can provide 100 times the computing power of CPUs. GPU-accelerated instances combine the computing power of GPUs and CPUs and provide ready-to-use and scalable GPU computing resources for various scenarios such as AI, high-performance computing, and professional graphics processing. For more information, see What is Elastic GPU Service?

  • DeepGPU

    DeepGPU is a collection of GPU enhancement tools that are provided by Alibaba Cloud for Elastic GPU Service. DeepGPU helps you quickly build enterprise-level services based on Infrastructure as a service (IaaS) products. You can use all components in DeepGPU together with Elastic GPU Service free of charge. This helps you use GPU resources on Alibaba Cloud in an efficient manner. DeepGPU includes the following components:

    • AIACC-ACSpeed (ACSpeed): also known as AIACC-Training V2.0. ACSpeed is a communication optimization library that is used for the distributed training of AI models released by Alibaba Cloud. ACSpeed provides solutions based on decoupled modules. For more information, see What is AIACC-ACSpeed?

    • AGSpeed: an optimizing compiler for AI training that is developed by Alibaba Cloud. AGSpeed is used to optimize the computing performance of PyTorch models on Alibaba Cloud GPU-accelerated compute-optimized instances. For more information, see What is AIACC-AGSpeed?

    • FastGPU: a set of fast deployment tools provided by Alibaba Cloud for AI computing. For more information, see What is FastGPU?

    • cGPU: a GPU-shared container technology developed by Alibaba Cloud. cGPU supports kernel-based isolation of virtual GPU resources to help you quickly deploy containers. You can split and assign a single GPU to multiple isolated containers. This way, you can ensure business security and save costs by improving GPU resource utilization. For more information, see What is cGPU?