Deep Learning Containers (DLC) of Platform for AI (PAI) is a cloud-native AI training platform that offers developers and enterprises a flexible, stable, easy-to-use, and high-performance environment for machine learning. It supports a wide range of algorithm frameworks, large-scale distributed deep learning jobs, as well as custom algorithm frameworks. DLC enables developers and enterprises to leverage an enhanced training environment, which helps reduce costs and boost training efficiency.
Benefits
Diverse computing resources:
DLC leverages Lingjun AI Computing Service and general computing resources to support a variety of computing options, including Elastic Computing Service (ECS), Elastic Container Instance (ECI), ECS bare metal instance, and Lingjun bare metal instance. This enables hybrid scheduling of heterogeneous computing resources.
Diverse distributed job types:
As a distributed training system, DLC simplifies the process of job submission for over ten training frameworks, such as Megatron, Deepspeed, Pytorch, Tensorflow, Slurm, Ray, MPI, and XGBoost, without the need for various clusters. DLC offers multiple official images, supports custom development environments and submission methods such as console, SDK, or command line. This provides a comprehensive service for AI training scenarios and a streamlined integration method for large-scale customers.
High stability:
In scenarios involving foundation model training, PAI-DLC addresses stability issues with its proprietary fault tolerance engine AIMaster, the high-performance Checkpoint framework EasyCKPT, health detection tool SanityCheck, and node self-healing capabilities. These features enable rapid detection, precise sensing, and prompt feedback, effectively minimizing computing power loss and enhancing training stability.
High performance:
The AI training acceleration framework developed by PAI delivers high performance by unifying data parallelism, pipeline parallelism, operator splitting, and nested parallel acceleration strategies. It automatically explores parallel strategies and optimizes multi-dimensional video memory, coupled with topology-aware scheduling of high-speed networks, and enhances distributed training efficiency through an optimized communication library featuring communication thread pools, gradient grouping fusion, mixed precision communication, and gradient compression. This is particularly beneficial in foundation model pre-training, continuous training, and alignment distributed training scenarios, providing an optimal training engine.
Resource type
PAI offers two resource types for submitting training jobs through DLC, based on the scenario and computing resources required:
Lingjun resources: Tailored for foundation model training, this service is ideal for deep learning jobs that demand extensive computing resources. It caters to ultra-large-scale deep learning and integrated AI computing, leveraging integrated software and hardware optimization technology to create a high-performance heterogeneous computing foundation. It provides comprehensive AI engineering capabilities, characterized by high performance, efficiency, and utilization, to meet the diverse needs of large model training, autonomous driving, basic research, and finance.
General computing resources: This type is suitable for standard training, offering flexible support for various scales and types of machine learning jobs.
Lingjun AI Computing Service and general computing resources support the following sources:
Resource quota: Secure Lingjun AI Computing Service or general computing resources in advance on a subscription basis for AI development and training, enabling flexible management and efficient resource utilization.
Public resources: Utilize Lingjun AI Computing Service or general computing resources on demand when submitting training jobs, billed on a pay-as-you-go basis.
Preemptible resources: Lingjun AI Computing Service provides preemptible resources, allowing you to access the necessary AI computing power at reduced costs, thus lowering the resource expenses for job execution.
Scenarios
Data preprocessing
Enables customization of the runtime environment for offline parallel preprocessing of data, substantially simplifying the engineering challenges associated with data preprocessing.
Large-scale distributed training
Facilitates offline large-scale distributed deep training with multiple open-source deep learning frameworks. DLC supports simultaneous training across thousands of nodes, dramatically reducing training durations.
Offline inference
DLC supports offline inference of models, optimizing the use of idle GPU resources and significantly cutting down on resource wastage.
References
Learn how to submit training jobs by using console, SDK, or command line, and how to configure the key parameters.
Learn how to use DLC through use cases.