All Products
Search
Document Center

Platform For AI:Sanity check

Last Updated:Dec 17, 2024

This topic describes how to use sanity check provided by Deep Learning Containers (DLC).

Overview

You may encounter the following issues when you run a DLC job in Platform for AI (PAI):

  • The job fails after loading the model checkpoints or performing other initialization operations due to resource failure. You need to troubleshoot before submitting the job again. This process results in a waste of GPU resources.

  • The model performance degrades when the job is running due to slow nodes, but it is hard to locate the issue in a quick and effective manner. It is also hard to test the GPU computing power and communication performance of instances in the resource group due to the lack of a convenient and reliable benchmark.

To handle the preceding issues, DLC provides the sanity check feature to check the health status and performance of computing resources that are used to run distributed training jobs. You can enable sanity check when you create a DLC job. The system detects the resources that are related to the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. Sanity check effectively reduces failures in the early stage of a training job and increases the possibility of job success. After the sanity check is completed, the system generates a test report on the computing power and communication performance of the related GPUs. You can use the report to identify and locate potential risks that may degrade the training performance and handle the issues in an efficient manner.

Limits

  • You can enable sanity check only for the DLC job that runs on intelligent computing LINGJUN resources in the China (Ulanqab) and Singapore regions.

  • You can enable sanity check only for PyTorch jobs that use more than 0 GPU.

Enable sanity check

Enable sanity check in the PAI console

When you create a DLC job in the PAI console, you enable Sanity Check in the Fault Tolerance and Diagnosis section and configure the related parameters. For more information, see Submit training jobs. After you enable sanity check and submit a training job, the system takes some time to check the health status and availability of resources and provides a check report.

image

The following table describes key parameters.

Parameter

Description

Check Time

  • Before Job Runs: After the job obtains the resources, the system checks the health status of the computing power and then runs the job. This is the default setting.

  • After Job Runs: After the system restarts a failed job, the system runs the sanity check first.

    Note

    This option is available if you enable Automatic Fault Tolerance feature. For more information, see AIMaster: Elastic fault tolerance engine.

Check Item

By default, GPU GEMM and All-Reduce Checks are selected. You can select Computing Performance Check, Node Communication Check, Computing and Communication Cross-check, and Model Simulation Check. For more information about the check items and recommended scenarios, see Appendix: Check items.

Maximum Check Duration (Minutes)

The maximum duration for which a sanity check runs. Default value: 30 minutes. If the sanity check runs a longer period of time than the specified maximum check duration, the configured action is triggered.

Timeout Action

Specify a job status after a sanity check times out:

  • Stop Job (default): The system stops the job. The status of the job changes to Check Failed.

  • Suspend Job: The system suspends the job. The job remains in the Checking state and waits for manual intervention or system instructions on the next operation.

Other Configurations

This parameter is empty by default.

View the check results

Sanity check status

The DLC job may be in one of the following statuses during a sanity check:

  • Checking: The sanity check on computing power is in progress.

  • Check Failed: The sanity check fails if issues are detected or the check times out.

  • Check Passed: After the job passes the sanity check, the job enters the running status.

View the results of a sanity check

View the results in the PAI console

On the Events tab of the DLC job details page, click Sanity Check to view the check results.

Configure an event rule

You can create an event notification rule on the Events tab of a PAI workspace. Set Event Type to DLC Job and Automatic Fault Tolerance. For more information about other parameters, see Create a notification rule. If the job fails the sanity check, the system sends a notification.

Note

For more information about configuring a notification, see Workspace notification.

Appendix: Check items

Note

The estimated time is based on two instances and is for your reference only.

Check items

Description (scenario)

Estimated time

Computing Performance Check

GPU GEMM

Used to check GPU GEMM performance. Can detect:

  • Faulty GPU: Calculation errors, calculation hangs

  • Slow nodes: low TFLOPS during computation.

1 minute

GPU Kernel Launch

Used to check the startup latency of GPU kernel. Can detect:

  • Faulty nodes: Kernel startup error, Kernel startup hangs.

  • Slow nodes: Kernel startup takes a long time.

1 minute

Node Communication Check

All-Reduce

Used to check node communication performance and identify slow communication nodes or faulty nodes. Under different communication modes, can detect:

  • Communication failure nodes: communication error and hangs.

  • Communication slow nodes: low communication bandwidth.

5 minutes for a single check

All-to-All

All-Gather

Multi-All-Reduce

Network Connectivity

Used to check the network connectivity between the front and rear ends and identify nodes with abnormal communication.

2 minutes

Computing and Communication Cross-Check

MatMul/All-Reduce Overlap

Used to check the performance of a single node when communication kernel and computing kernel overlap. Can detect:

  • Faulty nodes: Overlap computing error and hangs.

  • Slow nodes: Overlap computing takes a long time.

1 minute

Model Simulation Check

Mini GPT

Use model simulations to verify AI system reliability. Can detect:

  • Faulty nodes: Abnormal training loss, training hangs, training errors.

  • Slow nodes: Long time consumption during single-step training.

1 minute

Megatron GPT

5 minutes

ResNet

2 minutes