This topic describes the whole use process of Deep Learning Containers (DLC).
You can use DLC to initiate large-scale distributed jobs to train models with DLC. Take the following steps:
Before you submit a training job, make sure the following prerequisites are met:
You can choose Lingjun resources for foundation model training jobs, or general computing resources for standard training jobs.
Prepare an image for the training environment. You can use official images or custom images.
Upload the data required for the training job to an Object Storage Service (OSS) bucket or File Storage NAS file system. Then, create a dataset for the training job to store necessary training and output files.
Prepare the code required for the training job. For streamlined use and management across multiple jobs, we recommend that you add the code build as an AI asset on the
page of the console.
Create a training job.
You can submit training jobs by using the console, SDK, or command line. For more information about how to configure the parameters, see Submit training jobs.
When submitting a traning job, you can configure the following advanced features:
AIMaster: Elastic fault tolerance engine
When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.
Sanity check detects the resources that are related to the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. Sanity check effectively reduces failures in the early stage of a training job and increases the possibility of job success.
Use EasyCkpt to save and resume foundation model trainings
EasyCkpt provides capabilities to save and resume model training without compromising the entire process of foundation model training.
RDMA: high-performance networks for distributed training
For jobs using serverless Lingjun resources, you can use a high-performance Remote Direct Memory Access (RDMA) networks for distributed training.
Use cloud storage for a DLC training job
You can use OSS, NAS, Cloud Parallel File Storage (CPFS), or MaxCompute storage by code or mounting. This enables direct data read from and write to the storage during training.
The log forwarding feature of Simple Log Service allows you to forward logs of DLC jobs from the current workspace of to a specific Logstore for custom analysis.
Use preemptible resouces when creating a job in DLC with Lingjun AI Computing Service resources.
View and manage training jobs.
After you submit a job, you can view the status of the job. For more information, see View training jobs. You can also manage jobs with operations like stopping, cloning, sharing, generating scripts, and deleting. For more information, see Manage training jobs.
Monitor training jobs.
After you submit a job, you can use the following methods to monitor the job:
Use TensorBoard to analyze the training job report when a dataset is bound to the job.
View DLC job resource status or set alert rules with CloudMonitor or ARMS. For details, see Training monitoring and alerting.
On the Events tab of the Workspace Details page, create notification rules to track and monitor DLC job statuses. For more information, see Create a notification rule.
Configure periodic scheduling of training jobs.
You can configure periodic scheduling for continuous incremental training and model tuning with updated test data or hyperparameters.
For more DLC use cases, see DLC use cases.