Configure node self-healing notification

Updated at: 2025-02-25 02:47

This topic describes how to set up notifications for node self-healing for Lingjun AI Computing Service, enabling you to receive alerts promptly when service nodes experience exceptions. After receiving the notifications, you need to clear tasks from the affected node swiftly to facilitate the self-healing of the node.

Background information

The system automatically initiates a failover to a standby machine upon detecting a node issue, maintaining the stability and high availability of your resources. Notifications can be activated for the following scenarios:

  • Node scheduling prohibited

  • Node self-healing obstructed: Tasks running on the abnormal node can impede the self-healing process. The following actions are necessary:

    • DSW instance: Manually save the environment and shut down the instance, or configure automatic instance restarts through the DSW policy in the scheduling center.

    • DLC task: Manually stop the job.

Limits

This feature is currently exclusive to Lingjun AI Computing Service resources and is supported in the China (Ulanqab) and Singapore regions.

Enable notifications

Enable notifications in the form of internal message or email when node scheduling is prohibited or your tasks are running on an abnormal node.

  1. Log on to the PAI console.

  2. In the upper right corner, click image to go to the Message Center.

    image

  3. In the left-side navigation pane, choose Message Settings > Common Settings.

  4. In the Notification Type column, find Product Message > Product operation notifications. Ensure the contact is added, then select Internal Messages or Email.

    image

    You will receive notifications detailing the affected node name, resource quota, and information about tasks running on the node if an abnormality is detected.

Procedure

After receiving a notification, follow these steps to clear DSW instances and DLC jobs from the abnormal node:

Migrate DSW instances

Method 1: Manual migration

For DSW instances on abnormal nodes, a browser pop-up will prompt you to save the environment and shut down the instance to support the self-healing of the node.

Method 2: Automatic migration

  1. Log on to the PAI console.

  2. In the left-side navigation pane, select Workspaces. On the Workspaces page, click the name of the desired workspace.

  3. On the Workspace Details page, click the Scheduling Center tab.

  4. In the DSW section, turn on the Enable Automatic Instance Migration from Abnormal Node switch.

    Once enabled, the system will automatically restart the instance in case of node abnormality, supporting the self-healing process and ensuring resource availability. The restart saves the environment image, but running processes is not recoverable.

For DSW instances on abnormal nodes, a browser pop-up will prompt you to save the environment and shut down the instance. It also shows the time remaining before an automatic restart.

Stop DLC jobs

  1. Click the link in the internal message or email to go to the Resource Quota page.

  2. Click and view the tasks under the node based on the node information provided by the notification. image

  3. Click the name of the affected DLC job. Then, in the upper right corner, click More > Stop. image

  4. Click Clone to replicate the job and reschedule it to a healthy node. For more information, see Clone a training job. image

  • On this page (1, T)
  • Background information
  • Limits
  • Enable notifications
  • Procedure
  • Migrate DSW instances
  • Stop DLC jobs
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare