This topic describes how to set up notifications for node self-healing for Lingjun AI Computing Service, enabling you to receive alerts promptly when service nodes experience exceptions. After receiving the notifications, you need to clear tasks from the affected node swiftly to facilitate the self-healing of the node.
Background information
The system automatically initiates a failover to a standby machine upon detecting a node issue, maintaining the stability and high availability of your resources. Notifications can be activated for the following scenarios:
Limits
This feature is currently exclusive to Lingjun AI Computing Service resources and is supported in the China (Ulanqab) and Singapore regions.
Enable notifications
Enable notifications in the form of internal message or email when node scheduling is prohibited or your tasks are running on an abnormal node.
Log on to the PAI console.
In the upper right corner, click
to go to the Message Center.

In the left-side navigation pane, choose .
In the Notification Type column, find . Ensure the contact is added, then select Internal Messages or Email.

You will receive notifications detailing the affected node name, resource quota, and information about tasks running on the node if an abnormality is detected.
Procedure
After receiving a notification, follow these steps to clear DSW instances and DLC jobs from the abnormal node:
Migrate DSW instances
Method 1: Manual migration
For DSW instances on abnormal nodes, a browser pop-up will prompt you to save the environment and shut down the instance to support the self-healing of the node.
Method 2: Automatic migration
Log on to the PAI console.
In the left-side navigation pane, select Workspaces. On the Workspaces page, click the name of the desired workspace.
On the Workspace Details page, click the Scheduling Center tab.
In the DSW section, turn on the Enable Automatic Instance Migration from Abnormal Node switch.
Once enabled, the system will automatically restart the instance in case of node abnormality, supporting the self-healing process and ensuring resource availability. The restart saves the environment image, but running processes is not recoverable.
For DSW instances on abnormal nodes, a browser pop-up will prompt you to save the environment and shut down the instance. It also shows the time remaining before an automatic restart.
Stop DLC jobs
Click the link in the internal message or email to go to the Resource Quota page.
Click and view the tasks under the node based on the node information provided by the notification. 
Click the name of the affected DLC job. Then, in the upper right corner, click . 
Click Clone to replicate the job and reschedule it to a healthy node. For more information, see Clone a training job. 