This topic describes how to use sanity check provided by Deep Learning Containers (DLC).
Overview
You may encounter the following issues when you run a DLC job in Platform for AI (PAI):
The job fails after loading the model checkpoints or performing other initialization operations due to resource failure. You need to troubleshoot before submitting the job again. This process results in a waste of GPU resources.
The model performance degrades when the job is running due to slow nodes, but it is hard to locate the issue in a quick and effective manner. It is also hard to test the GPU computing power and communication performance of instances in the resource group due to the lack of a convenient and reliable benchmark.
To handle the preceding issues, DLC provides the sanity check feature to check the health status and performance of computing resources that are used to run distributed training jobs. You can enable sanity check when you create a DLC job. The system detects the resources that are related to the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. Sanity check effectively reduces failures in the early stage of a training job and increases the possibility of job success. After the sanity check is completed, the system generates a test report on the computing power and communication performance of the related GPUs. You can use the report to identify and locate potential risks that may degrade the training performance and handle the issues in an efficient manner.
Limits
You can enable sanity check only for the DLC job that runs on intelligent computing LINGJUN resources in the China (Ulanqab) and Singapore regions.
You can enable sanity check only for PyTorch jobs that use more than 0 GPU.
Enable sanity check
Enable sanity check in the PAI console
When you create a DLC job in the PAI console, you enable Sanity Check in the Fault Tolerance and Diagnosis section and configure the related parameters. For more information, see Submit training jobs. After you enable sanity check and submit a training job, the system takes some time to check the health status and availability of resources and provides a check report.
The following table describes key parameters.
Parameter | Description |
Check Time |
|
Check Item | By default, GPU GEMM and All-Reduce Checks are selected. You can select Computing Performance Check, Node Communication Check, Computing and Communication Cross-check, and Model Simulation Check. For more information about the check items and recommended scenarios, see Appendix: Check items. |
Maximum Check Duration (Minutes) | The maximum duration for which a sanity check runs. Default value: 30 minutes. If the sanity check runs a longer period of time than the specified maximum check duration, the configured action is triggered. |
Timeout Action | Specify a job status after a sanity check times out:
|
Other Configurations | This parameter is empty by default. |
View the check results
Sanity check status
The DLC job may be in one of the following statuses during a sanity check:
Checking: The sanity check on computing power is in progress.
Check Failed: The sanity check fails if issues are detected or the check times out.
Check Passed: After the job passes the sanity check, the job enters the running status.
View the results of a sanity check
View the results in the PAI console
On the Events tab of the DLC job details page, click Sanity Check to view the check results.
Configure an event rule
You can create an event notification rule on the Events tab of a PAI workspace. Set Event Type to DLC Job and Automatic Fault Tolerance. For more information about other parameters, see Create a notification rule. If the job fails the sanity check, the system sends a notification.
For more information about configuring a notification, see Workspace notification.
Appendix: Check items
The estimated time is based on two instances and is for your reference only.
Check items | Description (scenario) | Estimated time | |
Computing Performance Check | GPU GEMM | Used to check GPU GEMM performance. Can detect:
| 1 minute |
GPU Kernel Launch | Used to check the startup latency of GPU kernel. Can detect:
| 1 minute | |
Node Communication Check | All-Reduce | Used to check node communication performance and identify slow communication nodes or faulty nodes. Under different communication modes, can detect:
| 5 minutes for a single check |
All-to-All | |||
All-Gather | |||
Multi-All-Reduce | |||
Network Connectivity | Used to check the network connectivity between the front and rear ends and identify nodes with abnormal communication. | 2 minutes | |
Computing and Communication Cross-Check | MatMul/All-Reduce Overlap | Used to check the performance of a single node when communication kernel and computing kernel overlap. Can detect:
| 1 minute |
Model Simulation Check | Mini GPT | Use model simulations to verify AI system reliability. Can detect:
| 1 minute |
Megatron GPT | 5 minutes | ||
ResNet | 2 minutes |