Enable fault tolerance monitoring when you create a DLC job - Platform For AI

Deep Learning Container (DLC) provides the fault tolerance monitoring feature that is empowered by AIMaster. This topic describes how to enable and configure the feature.

Background information

As the scale of model and data increases, deep learning jobs are frequently run in a distributed manner. When a deep learning job is run by an increased number of instances, occasional exceptions in the underlying software stack or hardware environment can terminate the job.

To ensure that large-scale deep learning jobs are reliably run in a distributed manner, DLC provides the fault tolerance monitoring feature that is empowered by AIMaster. AIMaster is a job-level component. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.

Step 1: Configure fault tolerance monitoring parameters

The following table lists all parameters that the fault tolerance monitoring feature supports. You can refer to the Sample of Common Parameters section to plan the fault tolerance monitoring for your job. When you enable the fault tolerance monitoring feature in the subsequent step, you can specify these parameters in the Other Configuration field based on your business requirements.

All parameters

The fault tolerance monitoring supports the following parameters.

Category	Configuration	Parameter	Description	Default Value
General configuration	Job type	--job-execution-mode	The type of the DLC job. Valid values: Sync: Synchronous job. Async: Asynchronous job. The fault tolerance policy varies based on the job type. For example: When a retriable error occurs, a synchronous job requires a full restart. When a retriable error occurs, an asynchronous job requires restarting only the instance that failed.	Sync
	Job restart	--enable-job-restart	Specifies whether to restart the job when a fault tolerance condition is triggered or a runtime exception is detected. Valid values: False True	False
	Job restart	--max-num-of-job-restart	The maximum number of attempts to restart the job. If this limit is exceeded, the job is reported as failed.	3
Runtime configuration Note This configuration takes effect only if all instances of the job run as expected.	Hang detection for running jobs	--enable-job-hang-detection	Specifies whether to enable hang detection when the job is running. This parameter is valid only for synchronous jobs. Valid values: False: disables hang detection. True: enables hang detection. If the standard output (STDOUT) and standard error (STDERR) logs of all instances are not updated within the duration that you specify for the --job-hang-interval parameter, the job is reported as hung and is restarted.	False
		--job-hang-interval	The maximum duration during which the job can be non-responsive. Valid values: positive integers. Unit: seconds. If the job remains non-responsive for a duration longer than this value, the job is reported as hung and is restarted.	1800
		--enable-c4d-hang-detection	Specifies whether to activate the Calibrating Collective Communication over Converged ethernet - Diagnosis (C4D) detection feature to swiftly diagnose and identify slow or faulty nodes causing job hangs during job execution. Note This parameter is effective only when --enable-job-hang-detection is enabled.	False
	Hang detection for exiting jobs	--enable-job-exit-hang-detection	Specifies whether to enable hang detection when the job is exiting. This parameter is valid only for synchronous jobs. Valid values: False: disables hang detection. True: enables hang detection. If an instance runs as expected but the corresponding job fails to exit within the duration that you specify for the --job-exit-hang-interval parameter, the job is reported as hung and is restarted.	False
	Hang detection for exiting jobs	--job-exit-hang-interval	The maximum duration during which the job can be non-responsive when the job is exiting. Valid values: positive integers. Unit: seconds. If the job fails to exit within this duration, the job is reported as hung and is restarted.	600
Fault tolerance configuration Note This configuration takes effect only if an instance of the job fails to run.	Fault tolerance policy	--fault-tolerant-policy	Valid values: OnFailure: If an asynchronous job encounters an error, the failed instance is unconditionally restarted. If a synchronous job encounters an error, the job is unconditionally restarted. ExitCodeAndErrorMsg: checks whether the retry condition is met based on the exit code and error log of the failed instance. For more information, see Step 3: Configure enhanced fault tolerance monitoring. If the retry condition is met and the job is an asynchronous job, the failed instance is restarted. If the retry condition is met and the job is a synchronous job, the job is restarted. Never: reports the job as failed without an attempt to restart the job.	ExitCodeAndErrorMsg
	Maximum Error Occurrences	--max-num-of-same-error	The maximum number of times an error can occur on a single instance. If an error occurs for a number of times that exceeds this value, the job is reported as failed.	10
	Maximum Fault Tolerance Rate	--max-tolerated-failure-rate	The maximum percentage of failed instances. If the percentage of failed instances exceeds this value, the job is reported as failed. Default value: -1, which indicates that the parameter is disabled. For example, a value of 0.3 indicates that if more than 30% of the job instances fail, the job is reported as failed.	-1

Sample of Common Parameters

This section provides examples of common configurations for different types of DLC jobs.

Synchronous training jobs (such as PyTorch jobs)
If a job instance is unexpectedly terminated and the exit code or error log meets the fault tolerance conditions, such as preemption, the job is restarted.
```
--job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
```
Asynchronous training jobs (such as TensorFlow jobs)
If a retriable error occurs on a worker instance of the job, the worker instance is restarted. If an error occurs on a parameter server or chief instance, the job is not allowed to restart. To restart the job in the preceding scenario, set the --enable-job-restart parameter to True.
```
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
```
Offline inference jobs (such as ElasticBatch jobs)
The instances of an offline inference job are independent from each other, which is similar to an asynchronous job. If a job instance is unexpectedly terminated, only the instance is restarted.
```
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
```

Step 2: Enable fault tolerance monitoring

When you submit a DLC job, you can enable the fault tolerance monitoring feature by using the PAI console or DLC SDK.

Use the PAI console

When you create a DLC job in the PAI console, you can turn on Automatic Fault Tolerance in the Fault Tolerance and Diagnosis section and configure the parameters. For more information, see Submit training jobs. After you configure the parameters, an AIMaster instance is launched when you run the DLC job. The AIMaster instance monitors the job progress and manages fault tolerance when errors occur.

In the panel:

You can configure additional parameters in the Other Configuration text box. For more information about the parameters, see Step 1: Configure fault tolerance monitoring parameters.
If you enable Hanging Detection, you can also enable the C4D Detection feature. C4D is a tool developed by Alibaba Cloud for diagnosing slow or hung jobs during model training. It has helped locate issues during many model training jobs. For more information about the core features and parameters of C4D, see Use C4D.
Note
- Before you use C4D, you must install ACCL, a high-performance collective communication library developed by Alibaba Cloud. For more information, see ACCL: high-performance collective communication library developed by Alibaba Cloud.
- The C4D detection feature is available when DLC jobs use Lingjun AI Computing Service resources.

Use DLC SDK

SDK for Go

Sample code to enable fault tolerance:

createJobRequest := &client.CreateJobRequest{}
settings := &client.JobSettings{
    EnableErrorMonitoringInAIMaster: tea.Bool(true),
    ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"),
}
createJobRequest.SetSettings(settings)

Parameters:

EnableErrorMonitoringInAIMaster: specifies whether to enable the fault tolerance monitoring feature.
ErrorMonitoringArgs: additional parameters that you can configure for the fault tolerance monitoring feature.

SDK for Python

Sample code to enable fault tolerance:

from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings

settings = JobSettings(
    enable_error_monitoring_in_aimaster = True,
    error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30"
)
create_job_req = CreateJobRequest(
    ...
    settings = settings,
)

Parameters:

enable_error_monitoring_in_aimaster: specifies whether to enable the fault tolerance monitoring feature.
error_monitoring_args: additional parameters that you can configure for the fault tolerance monitoring feature.

Step 3: Configure enhanced fault tolerance monitoring

You can configure the following features of enhanced fault tolerance monitoring based on your business requirements.

Fault tolerance notification

If you want to receive notifications about fault tolerance events, such as a job restart, you can configure a rule for such notifications by performing the following steps: Go to the Workspace Details page, click the Events tab, and then click Create Event Rule. In the panel that appears, select DLC Jobs > Automatic Fault Tolerance from the Event Type drop-down list. For more information, see Workspace notification.

You can also use AIMaster SDK to configure custom notifications when an error occurs, such as when the loss function returns NaN. Sample code:

Note

To configure custom notifications, you must install the wheel package of AIMaster. For more information, see FAQ.

from aimaster import job_monitor as jm

job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())

...

if loss == Nan and rank == 0:
  st = job_monitor_client.send_custom_message(content="The loss function returns NaN")
  if not st.ok():
      print('failed to send message, error %s' % st.to_string())

Custom keywords

The fault tolerance monitoring feature provides a built-in module that can automatically detect common retriable errors. If you want to trigger fault tolerance when error logs carry information that matches specific keywords, you can use the following methods to configure custom keywords. After you configure custom keywords, the module scans the tail logs of failed instances for key information that matches your custom keywords.

Note

You must set the --fault-tolerant-policy parameter to ExitCodeAndErrorMsg.

Sample configuration for PyTorch jobs:

from aimaster import job_monitor as jm

jm_config_params = {}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

In the preceding code, the custom keywords are configured by using the monitor.set_retryable_errors function.

Sample configuration for TensorFlow jobs:

from aimaster import job_monitor as jm

jm_config_params = {}
jm_config = jm.TFConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

Fine-grained hang detection

By default, hang detection is configured for the entire job runtime. However, you may need different configurations at different runtime stages. For example, when a large-scale job is initializing, it takes a long time to establish the connections between instances, whereas the execution of the job is faster. DLC allows you to configure the hang detection policy based on the current status of the job by using the following methods:

monitor.reset_config(jm_config_params)

# Example:
#     monitor.reset_config(job_hang_interval=10)
#     or
#     config_params = {"job_hang_interval": 10, }
#     monitor.reset_config(**config_params)

Sample configuration for a PyTorch job:

import torch
import torch.distributed as dist
from aimaster import job_monitor as jm

jm_config_params = {
    "job_hang_interval": 1800 # Configure a global hang detection policy. This value specifies that the maximum duration during which the job can be non-responsive is 30 minutes.
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)

dist.init_process_group('nccl')

...

# impl these two funcs in aimaster sdk
# user just need to add annotations to their func
def reset_hang_detect(hang_seconds):
    jm_config_params = {
        "job_hang_interval": hang_seconds
    }
    monitor.reset_config(**jm_config_params)

def hang_detect(interval):
    reset_hang_detect(interval)
    ...

@hang_detect(180) # reset hang detect to 3 min, only for func scope
def train():
    ...

@hang_detect(-1) # disable hang detect temperally, only for func scope
def test():
    ...

for epoch in range(0, 100):
    train(epoch)
    test(epoch)
    self.scheduler.step()

Use C4D

Calibrating Collective Communication over Converged ethernet - Diagnosis (C4D) is a diagnostic tool by Alibaba Cloud designed to address slow or hung jobs in large-scale model training. Before you use C4D, you must install ACCL, a high-performance collective communication library developed by Alibaba Cloud. For more information, see ACCL: high-performance collective communication library developed by Alibaba Cloud.The C4D detection feature is available when DLC jobs use Lingjun AI Computing Service resources.

Overview

C4D gathers status data from all nodes during collective communication to analyze potential communication and non-communication issues. The system architecture is depicted in the figure below:

All parameters

After you enable the C4D detection feature, you can configure the following parameters in the Other Configurations text box.

Parameter

Description

Example

--c4d-log-level

The C4D log level. Options include:

Info
Warning
Error

The default value is Warning, which outputs logs at the Warning and Error levels. For routine operations, we recommend that you use the default value. For performance troubleshooting, you can set the parameter to Info.

--c4d-log-level=Info

--c4d-common-envs

The C4D execution environment variables. Set the variables in the format k1=v1,k2=v2, with commas separating multiple variables. By default, this parameter is left empty. Valid values:

C4D_HANG_TIMEOUT: Specifies the time in microseconds, a job can hang before triggering a warning. Default value: 10000000, which is equivalent to 10 seconds.
C4D_HANG_TIMES: Specifies the threshold for job hang occurrences necessary to log an error and initiate the automatic isolation of a node. It works together with C4D_HANG_TIMEOUT. Default value: 18, which means 3 minutes of hang time before triggering node isolation.
C4D_CONN_BW_CHECK_PERIOD: The bandwidth check interval. Default value: 10. Unit: seconds.
C4D_RUNTIME_LOG_LEVEL: The log level for C4D at runtime. Valid values:
- TRACE
- DEBUG
- INFO (default)
- WARNING
- ERROR
- FATAL
C4D_ENABLE_STATS_OUTPUT: Specifies whether to output C4D-related statistics. Valid values:
- TRUE
- FALSE (default)

--c4d-common-envs=C4D_HANG_TIMEOUT=1,C4D_HANG_TIMES=2

For error-level logs, AIMaster automatically blacklists the affected nodes and restarts the job. The following table shows the processing logic for each level.

Log Level	Description	Action
Error	By default, if communication layer hang time exceeds three minutes, the job fails. You can change the value by specifing the C4D_HANG_TIMEOUT and C4D_HANG_TIMES parameters.	AIMaster isolates the nodes indicated in the logs.
Warning	By default, if communication layer hang time exceeds 10 seconds, it impacts performance but doesn't cause failure. You can change the value by specifing the C4D_HANG_TIMEOUT parameter.	Nodes in logs are not automatically isolated. Manual confirmation is required.
Warning	Non-communication layer hang time exceeding 10 seconds may cause job failure.	Nodes in logs are not automatically isolated. Manual confirmation is required.
Info	Slow communication and non-communication layers.	Logs are for performance issues and require manual review.

If a DLC job appears slow or hung, click the job name in the DLC job list to access the job details page. In the instance section, you can review the AIMaster logs for C4D diagnostic results. For detailed analysis of the diagnosis results, see Diagnosis Result Samples. 5bc5051b1abae830588522ab7a50b23f

Diagnosis Result Samples

RankCommHang: Indicates a communication layer hang at a node.
RankNonCommHang: Indicates a non-communication layer hang, such as in computation.
RankCommSlow: Indicates a slow communication layer issue at a node.
RankNonCommSlow: Indicates a slow non-communication layer issue at a node.

FAQ

How to install AIMaster SDK

Install AIMaster SDK by using the wheel package that matches your Python version.

# py36
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl

# py38
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl

# py310
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl