This topic describes how to configure notifications for node self-healing. This feature ensures that you receive prompt alerts when the underlying machine nodes for Lingjun resources become abnormal. After you receive a notification, you must clear all tasks from the affected node as soon as possible to allow the self-healing process to complete.
Background information
When the system detects an abnormal node, it uses its self-healing capability to automatically switch to a standby machine. This ensures the stability and high availability of your resources. You can enable notifications for the following two scenarios:
Node scheduling disabled: The system identifies an abnormal node and temporarily disables scheduling to that node.
Node self-healing obstructed: The self-healing process is obstructed because tasks are running on the abnormal node. In this case, you must perform the following operations:
For a DSW instance: Manually save the environment and shut down the instance, or configure a policy in the DSW configuration of the scheduling center to automatically restart the instance.
For a DLC job: Manually stop the job.
Limits
This feature is available only for Lingjun resources.
Enable message notifications
You can receive notifications via internal messages or emails when the system disables node scheduling or when your task is running on an abnormal node. To ensure that you receive these notifications in a timely manner, we recommend that you enable them as follows:
Log on to the PAI console.
In the upper-right corner, click the
icon to go to the Message Center.
In the navigation pane on the left, choose .
In the Message Type column, find . Select Internal Message and Email, and confirm that a recipient has been added. You can also click Modify under Account Contact to configure more contacts.

After you complete the configuration, the system will notify you of the affected node name, the resource quota, and information about the tasks running on the node if an abnormality is detected.
User guide
After you receive a notification that node self-healing is obstructed, follow these steps to clear the DSW instances and DLC jobs from the abnormal node. This ensures that the node replacement can proceed normally.
Migrate DSW instances
Method 1: Manual migration
For a DSW instance on an abnormal node, if your browser is open, a pop-up window appears in the DSW instance. This window reminds you to save the environment and shut down the instance as soon as possible to allow the Lingjun node to self-heal.
Method 2: Automatic migration
Automatic migration is currently available in the China (Ulanqab) and Singapore regions.
Log on to the PAI console.
In the navigation pane on the left, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
On the right side of the Workspace Details page, choose .
In the DSW section, turn on the Enable Automatic Instance Migration from Abnormal Node switch.
After you enable this feature, the system automatically shuts down and restarts the instance when an underlying machine node for Lingjun resources becomes abnormal. This supports the self-healing process of the underlying node and ensures the high availability of your resources. During the restart, the environment image is saved, but running processes cannot be recovered.
For a DSW instance on an abnormal node, if your browser is open, a pop-up window appears. This window reminds you to save the environment and shut down the instance as soon as possible. It also displays the time remaining before an automatic restart occurs to allow the Lingjun node to self-heal.
Stop DLC jobs
Click the details link in the internal message, email to go to the resource quota page.
Based on the provided node information, click the node to view the list of tasks running on it.

Click the name of the DLC job to open the job details page. Then, in the upper-right corner, choose to stop the DLC job.

Click Clone. Your job reuses the original configuration and is scheduled to a healthy node. For more information, see Clone a training job.
