All Products
Search
Document Center

Platform For AI:SanityCheck: Computing power health check

Last Updated:Feb 12, 2026

This topic describes how to use the computing power health check feature provided by Deep Learning Containers (DLC).

Function introduction

In AI training scenarios, the following issues can occur:

  • Resource failures that interrupt jobs and waste GPU resources: A job might fail to start training because of faulty resources, even after spending time on initialization operations such as loading model checkpoints. This requires you to investigate the issue and resubmit the job, which wastes GPU resources.

  • Lack of effective methods for locating performance issues and testing: If model training performance degrades while a job is running, slow nodes might be the cause. However, quick and effective methods to locate the issue are often unavailable. A convenient and reliable benchmark program is also needed to test the GPU computing power and communication performance of machines in the resource group.

To address these issues, DLC provides the computing power health check (SanityCheck) feature. This feature checks the health and performance of computing resources for distributed training jobs. You can enable this feature when you create a DLC training job. The health check inspects all resources used for training, automatically isolates faulty nodes, and triggers an automated Operations and Maintenance (O&M) process in the background. This process reduces the chance of issues occurring early in the training and increases the job success rate. After the check is complete, the system provides a report on GPU computing power and communication performance. This report helps you identify and locate factors that may degrade training performance and improves diagnostic efficiency.

Limitations

Currently, this feature only supports PyTorch training jobs created using Lingjun Intelligent Computing resources, and requires that the GPU count (number of cards) be configured at the instance level. Lingjun Intelligent Computing resources are currently available only to allowlisted users. If needed, please contact your account manager.

Enable health check

Using the console

When you create a DLC job in the PAI console, you can enable the health check feature by configuring the following key parameters. After the job is successfully created, the system checks the health status and availability of the resources and provides the results. This process may take some time.

image

The key parameter settings are described below:

  • Resource Information configuration:

    Parameter

    Description

    Resource Type

    Select Lingjun Intelligence Resources. If Lingjun AI Computing resource is unavailable, please contact your sales manager.

    Source

    Select Resource Quota.

    Resource Quota

    Select an existing Lingjun resource quota. For more information about how to create a resource quota, see Create a resource quota.

    Framework

    Select PyTorch.

    Job Resource

    GPU (number of cards) must be configured at the instance level.

  • Fault Tolerance and Diagnosis configuration: Enable the Health Check switch and configure the following parameters:

    Parameter

    Description

    Check Time

    • Before Job Runs (Default): After the job acquires resources, the system first performs a pre-check on the computing nodes for the training job, and then executes your code.

    • After Job Restarts: When a job runs abnormally and is restarted by the AIMaster automatic fault tolerance engine, a health check is performed.

      Note

      To select this option, you must enable the Automatic Fault Tolerance feature. For more information, see AIMaster: Elastic automatic fault tolerance engine.

    Check Items

    The validation suite includes four categories: compute performance validation, node communication validation, compute-communication interference validation, and model-based simulation validation. For detailed descriptions of each validation item and recommended use cases, see Appendix: Check item descriptions.

    By default, GPU GEMM (for assessing GPU GEMM performance) and All-Reduce (for evaluating inter-node communication performance and identifying slow or faulty nodes) are enabled. You can search for or select specific validation items, or choose a preset configuration to apply a recommended validation template with one click.

    Maximum Check Duration

    The maximum runtime for the health check. The default is 60 minutes. If the check times out, the exception handling policy is triggered.

    Exception Handling Policy

    When the health check fails, the system handles the job based on your selected policy:

    • Stop Job: If a faulty or suspicious node is identified, the job is terminated and marked as Check failed.

    • Blacklist and Rerun: If a faulty or suspicious node is identified, the system automatically blocks the node, restarts the job, and reruns the check until it passes.

    Maximum Restart Count

    When the exception handling policy is set to "Blacklist and rerun", you can configure the maximum number of restarts. The default is 10. If the number of restarts exceeds this limit, the job automatically fails.

    Other Configurations

    Empty by default. Supports advanced parameter settings.

View check results

Health check statuses

A DLC job can have the following statuses during a health check:

  • Checking: The computing power health check is in progress.

  • Check failed: If an abnormal node is detected or the check times out, the status changes to Check failed.

  • Check passed: After the health check passes, the job enters the Running status.

View health check results

Using the console

On the DLC job details page, go to the Events tab and click Sanity Check to view the check progress and results.

image

Click the Restart Records tab to view information such as the number of restarts, restart reasons, and restart results.

image

Configure notifications

You can create a notification rule in the event notification settings of your PAI workspace. Set Event Type to DLC Jobs > Automatic Fault Tolerance. For more information about other parameters, see Notifications. A notification is sent when the computing power health check fails.

Note

For instructions on creating notification rules in a workspace, see Event notification settings.

image

Appendix: Check item descriptions

Note

The estimated check duration is based on two machines and is for reference only. The actual duration may vary.

Check item

Description (Recommended scenario)

Estimated duration

Computing performance check

GPU GEMM

Detects GPU GEMM performance. Can identify:

  • Faulty GPUs: computation errors or hangs.

  • Slow nodes: low computation TFLOPS.

1 minute

GPU Kernel Launch

Detects GPU kernel launch latency. Can identify:

  • Faulty nodes: kernel launch errors or hangs.

  • Slow nodes: long kernel launch time.

1 minute

Node communication check

All-Reduce

Detects node communication performance and identifies slow or faulty communication nodes. In different communication modes, it can identify:

  • Communication fault nodes: communication errors or hangs.

  • Slow communication nodes: low communication bandwidth.

Single collective communication check

5 minutes

All-to-All

All-Gather

Multi-All-Reduce

Network Connectivity

Detects network connectivity of the head or tail nodes. Identifies nodes with abnormal communication connectivity.

2 minutes

Computing and communication cross-check

MatMul/All-Reduce Overlap

Detects the performance of a single node when communication and computation kernels overlap. Can identify:

  • Faulty nodes: overlap computation errors or hangs.

  • Slow nodes: long overlap computation time.

1 minute

Model simulation validation

Mini GPT

Uses model simulation to verify AI system reliability. Can identify:

  • Faulty nodes: abnormal training loss, training hangs, or training errors.

  • Slow nodes: long time for a single training step.

1 minute

Megatron GPT

5 minutes

ResNet

2 minutes