Background information
As the scale of model and data increases, deep learning jobs are frequently run in a distributed manner. When a deep learning job is run by an increased number of instances, occasional exceptions in the underlying software stack or hardware environment can terminate the job.
To ensure that large-scale deep learning jobs are reliably run in a distributed manner, DLC provides the fault tolerance monitoring feature that is empowered by AIMaster. AIMaster is a job-level component. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.
Step 1: Configure fault tolerance monitoring parameters
The following table lists all parameters that the fault tolerance monitoring feature supports. You can refer to the Sample of Common Parameters section to plan the fault tolerance monitoring for your job. When you enable the fault tolerance monitoring feature in the subsequent step, you can specify these parameters in the Other Configuration field based on your business requirements.
All parameters
The fault tolerance monitoring supports the following parameters.
Category | Configuration | Parameter | Description | Default Value |
Category | Configuration | Parameter | Description | Default Value |
General configuration | Job type | --job-execution-mode | The type of the DLC job. Valid values: Sync: Synchronous job. Async: Asynchronous job.
The fault tolerance policy varies based on the job type. For example: When a retriable error occurs, a synchronous job requires a full restart. When a retriable error occurs, an asynchronous job requires restarting only the instance that failed.
| Sync |
Job restart | --enable-job-restart | Specifies whether to restart the job when a fault tolerance condition is triggered or a runtime exception is detected. Valid values: | False |
--max-num-of-job-restart | The maximum number of attempts to restart the job. If this limit is exceeded, the job is reported as failed. | 3 |
Runtime configuration Note This configuration takes effect only if all instances of the job run as expected. | Hang detection for running jobs | --enable-job-hang-detection | Specifies whether to enable hang detection when the job is running. This parameter is valid only for synchronous jobs. Valid values: False: disables hang detection. True: enables hang detection. If the standard output (STDOUT) and standard error (STDERR) logs of all instances are not updated within the duration that you specify for the --job-hang-interval parameter, the job is reported as hung and is restarted.
| False |
--job-hang-interval | The maximum duration during which the job can be non-responsive. Valid values: positive integers. Unit: seconds. If the job remains non-responsive for a duration longer than this value, the job is reported as hung and is restarted. | 1800 |
--enable-c4d-hang-detection | Specifies whether to activate the Calibrating Collective Communication over Converged ethernet - Diagnosis (C4D) detection feature to swiftly diagnose and identify slow or faulty nodes causing job hangs during job execution. Note This parameter is effective only when --enable-job-hang-detection is enabled. | False |
Hang detection for exiting jobs | --enable-job-exit-hang-detection | Specifies whether to enable hang detection when the job is exiting. This parameter is valid only for synchronous jobs. Valid values: False: disables hang detection. True: enables hang detection. If an instance runs as expected but the corresponding job fails to exit within the duration that you specify for the --job-exit-hang-interval parameter, the job is reported as hung and is restarted.
| False |
--job-exit-hang-interval | The maximum duration during which the job can be non-responsive when the job is exiting. Valid values: positive integers. Unit: seconds. If the job fails to exit within this duration, the job is reported as hung and is restarted. | 600 |
Fault tolerance configuration Note This configuration takes effect only if an instance of the job fails to run. | Fault tolerance policy | --fault-tolerant-policy | Valid values: OnFailure: If an asynchronous job encounters an error, the failed instance is unconditionally restarted. If a synchronous job encounters an error, the job is unconditionally restarted.
ExitCodeAndErrorMsg: checks whether the retry condition is met based on the exit code and error log of the failed instance. For more information, see Step 3: Configure enhanced fault tolerance monitoring. If the retry condition is met and the job is an asynchronous job, the failed instance is restarted. If the retry condition is met and the job is a synchronous job, the job is restarted.
Never: reports the job as failed without an attempt to restart the job.
| ExitCodeAndErrorMsg |
Maximum Error Occurrences | --max-num-of-same-error | The maximum number of times an error can occur on a single instance. If an error occurs for a number of times that exceeds this value, the job is reported as failed. | 10 |
Maximum Fault Tolerance Rate | --max-tolerated-failure-rate | The maximum percentage of failed instances. If the percentage of failed instances exceeds this value, the job is reported as failed. Default value: -1, which indicates that the parameter is disabled. For example, a value of 0.3 indicates that if more than 30% of the job instances fail, the job is reported as failed. | -1 |
Sample of Common Parameters
This section provides examples of common configurations for different types of DLC jobs.
Synchronous training jobs (such as PyTorch jobs)
If a job instance is unexpectedly terminated and the exit code or error log meets the fault tolerance conditions, such as preemption, the job is restarted.
--job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
Asynchronous training jobs (such as TensorFlow jobs)
If a retriable error occurs on a worker instance of the job, the worker instance is restarted. If an error occurs on a parameter server or chief instance, the job is not allowed to restart. To restart the job in the preceding scenario, set the --enable-job-restart parameter to True.
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
Offline inference jobs (such as ElasticBatch jobs)
The instances of an offline inference job are independent from each other, which is similar to an asynchronous job. If a job instance is unexpectedly terminated, only the instance is restarted.
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
Step 2: Enable fault tolerance monitoring
When you submit a DLC job, you can enable the fault tolerance monitoring feature by using the PAI console or DLC SDK.
Use the PAI console
When you create a DLC job in the PAI console, you can turn on Automatic Fault Tolerance in the Fault Tolerance and Diagnosis section and configure the parameters. For more information, see Submit training jobs. After you configure the parameters, an AIMaster instance is launched when you run the DLC job. The AIMaster instance monitors the job progress and manages fault tolerance when errors occur.
In the panel:
You can configure additional parameters in the Other Configuration text box. For more information about the parameters, see Step 1: Configure fault tolerance monitoring parameters.
If you enable Hanging Detection, you can also enable the C4D Detection feature. C4D is a tool developed by Alibaba Cloud for diagnosing slow or hung jobs during model training. It has helped locate issues during many model training jobs. For more information about the core features and parameters of C4D, see Use C4D.
Use DLC SDK
Step 3: Configure enhanced fault tolerance monitoring
You can configure the following features of enhanced fault tolerance monitoring based on your business requirements.
Fault tolerance notification
If you want to receive notifications about fault tolerance events, such as a job restart, you can configure a rule for such notifications by performing the following steps: Go to the Workspace Details page, click the Events tab, and then click Create Event Rule. In the panel that appears, select DLC Jobs > Automatic Fault Tolerance from the Event Type drop-down list. For more information, see Workspace notification.
You can also use AIMaster SDK to configure custom notifications when an error occurs, such as when the loss function returns NaN. Sample code:
Note
To configure custom notifications, you must install the wheel package of AIMaster. For more information, see FAQ.
from aimaster import job_monitor as jm
job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())
...
if loss == Nan and rank == 0:
st = job_monitor_client.send_custom_message(content="The loss function returns NaN")
if not st.ok():
print('failed to send message, error %s' % st.to_string())
Custom keywords
The fault tolerance monitoring feature provides a built-in module that can automatically detect common retriable errors. If you want to trigger fault tolerance when error logs carry information that matches specific keywords, you can use the following methods to configure custom keywords. After you configure custom keywords, the module scans the tail logs of failed instances for key information that matches your custom keywords.
Note
You must set the --fault-tolerant-policy parameter to ExitCodeAndErrorMsg.
Sample configuration for PyTorch jobs:
from aimaster import job_monitor as jm
jm_config_params = {}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])
In the preceding code, the custom keywords are configured by using the monitor.set_retryable_errors function.
Sample configuration for TensorFlow jobs:
from aimaster import job_monitor as jm
jm_config_params = {}
jm_config = jm.TFConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])
Fine-grained hang detection
By default, hang detection is configured for the entire job runtime. However, you may need different configurations at different runtime stages. For example, when a large-scale job is initializing, it takes a long time to establish the connections between instances, whereas the execution of the job is faster. DLC allows you to configure the hang detection policy based on the current status of the job by using the following methods:
monitor.reset_config(jm_config_params)
Sample configuration for a PyTorch job:
import torch
import torch.distributed as dist
from aimaster import job_monitor as jm
jm_config_params = {
"job_hang_interval": 1800
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
dist.init_process_group('nccl')
...
def reset_hang_detect(hang_seconds):
jm_config_params = {
"job_hang_interval": hang_seconds
}
monitor.reset_config(**jm_config_params)
def hang_detect(interval):
reset_hang_detect(interval)
...
@hang_detect(180)
def train():
...
@hang_detect(-1)
def test():
...
for epoch in range(0, 100):
train(epoch)
test(epoch)
self.scheduler.step()
Use C4D
Calibrating Collective Communication over Converged ethernet - Diagnosis (C4D) is a diagnostic tool by Alibaba Cloud designed to address slow or hung jobs in large-scale model training. Before you use C4D, you must install ACCL, a high-performance collective communication library developed by Alibaba Cloud. For more information, see ACCL: high-performance collective communication library developed by Alibaba Cloud.The C4D detection feature is available when DLC jobs use Lingjun AI Computing Service resources.
Overview
C4D gathers status data from all nodes during collective communication to analyze potential communication and non-communication issues. The system architecture is depicted in the figure below:
All parameters
After you enable the C4D detection feature, you can configure the following parameters in the Other Configurations text box.
Parameter | Description | Example |
Parameter | Description | Example |
--c4d-log-level | The C4D log level. Options include: The default value is Warning, which outputs logs at the Warning and Error levels. For routine operations, we recommend that you use the default value. For performance troubleshooting, you can set the parameter to Info. | --c4d-log-level=Info
|
--c4d-common-envs | The C4D execution environment variables. Set the variables in the format k1=v1,k2=v2 , with commas separating multiple variables. By default, this parameter is left empty. Valid values: C4D_HANG_TIMEOUT: Specifies the time in microseconds, a job can hang before triggering a warning. Default value: 10000000, which is equivalent to 10 seconds. C4D_HANG_TIMES: Specifies the threshold for job hang occurrences necessary to log an error and initiate the automatic isolation of a node. It works together with C4D_HANG_TIMEOUT. Default value: 18, which means 3 minutes of hang time before triggering node isolation. C4D_CONN_BW_CHECK_PERIOD: The bandwidth check interval. Default value: 10. Unit: seconds. C4D_RUNTIME_LOG_LEVEL: The log level for C4D at runtime. Valid values: TRACE DEBUG INFO (default) WARNING ERROR FATAL
C4D_ENABLE_STATS_OUTPUT: Specifies whether to output C4D-related statistics. Valid values:
| --c4d-common-envs=C4D_HANG_TIMEOUT=1,C4D_HANG_TIMES=2
|
For error-level logs, AIMaster automatically blacklists the affected nodes and restarts the job. The following table shows the processing logic for each level.
Log Level | Description | Action |
Log Level | Description | Action |
Error | By default, if communication layer hang time exceeds three minutes, the job fails. You can change the value by specifing the C4D_HANG_TIMEOUT and C4D_HANG_TIMES parameters. | AIMaster isolates the nodes indicated in the logs. |
Warning | By default, if communication layer hang time exceeds 10 seconds, it impacts performance but doesn't cause failure. You can change the value by specifing the C4D_HANG_TIMEOUT parameter. | Nodes in logs are not automatically isolated. Manual confirmation is required. |
Non-communication layer hang time exceeding 10 seconds may cause job failure. | Nodes in logs are not automatically isolated. Manual confirmation is required. |
Info | Slow communication and non-communication layers. | Logs are for performance issues and require manual review. |
If a DLC job appears slow or hung, click the job name in the DLC job list to access the job details page. In the instance section, you can review the AIMaster logs for C4D diagnostic results. For detailed analysis of the diagnosis results, see Diagnosis Result Samples.
Diagnosis Result Samples
RankCommHang: Indicates a communication layer hang at a node.
RankNonCommHang: Indicates a non-communication layer hang, such as in computation.
RankCommSlow: Indicates a slow communication layer issue at a node.
RankNonCommSlow: Indicates a slow non-communication layer issue at a node.
FAQ
How to install AIMaster SDK
Install AIMaster SDK by using the wheel package that matches your Python version.
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl