This topic describes how to use sanity check provided by Deep Learning Containers (DLC).
Overview
You may encounter the following issues when you run a DLC job in Platform for AI (PAI):
The job fails after loading the model checkpoints or performing other initialization operations due to resource failure. You need to troubleshoot before submitting the job again. This process results in a waste of GPU resources.
The model performance degrades when the job is running due to slow nodes, but it is hard to locate the issue in a quick and effective manner. It is also hard to test the GPU computing power and communication performance of instances in the resource group due to the lack of a convenient and reliable benchmark.
To handle the preceding issues, DLC provides the sanity check feature to check the health status and performance of computing resources that are used to run distributed training jobs. You can enable sanity check when you create a DLC job. The system detects the resources that are related to the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. Sanity check effectively reduces failures in the early stage of a training job and increases the possibility of job success. After the sanity check is completed, the system generates a test report on the computing power and communication performance of the related GPUs. You can use the report to identify and locate potential risks that may degrade the training performance and handle the issues in an efficient manner.
Limits
You can enable sanity check only for the DLC job that runs on intelligent computing LINGJUN resources in the China (Ulanqab) and Singapore regions.
You can enable sanity check only for PyTorch jobs that use more than 0 GPU.
Enable sanity check
Enable sanity check in the PAI console
When you create a DLC job in the PAI console, you enable Sanity Check in the Fault Tolerance and Diagnosis section and configure the related parameters. For more information, see Submit training jobs. After you enable sanity check and submit a training job, the system takes some time to check the health status and availability of resources and provides a check report.
The following table describes key parameters.
Parameter | Description |
Check Time |
|
Maximum Check Duration (Minutes) | The maximum duration for which a sanity check runs. Default value: 30 minutes. If the sanity check runs a longer period of time than the specified maximum check duration, the configured action is triggered. |
Timeout Action | Specify a job status after a sanity check times out:
|
Other Configurations | This parameter is empty by default. |
View the check results
Sanity check status
The DLC job may be in one of the following statuses during a sanity check:
Checking: The sanity check on computing power is in progress.
Check Failed: The sanity check fails if issues are detected or the check times out.
Check Passed: After the job passes the sanity check, the job enters the running status.
View the results of a sanity check
View the results in the PAI console
On the Events tab of the DLC job details page, click Sanity Check to view the check results.
Configure an event rule
You can create an event notification rule on the Events tab of a PAI workspace. Set Event Type to DLC Job and Automatic Fault Tolerance. For more information about other parameters, see Create a notification rule. If the job fails the sanity check, the system sends a notification.
For more information about configuring a notification, see Workspace notification.