Configure node self-healing notification

0.0.201

This topic describes how to set up notifications for node self-healing for Lingjun AI Computing Service, enabling you to receive alerts promptly when service nodes experience exceptions. After receiving the notifications, you need to clear tasks from the affected node swiftly to facilitate the self-healing of the node.

Background information

The system automatically initiates a failover to a standby machine upon detecting a node issue, maintaining the stability and high availability of your resources. Notifications can be activated for the following scenarios:

Node scheduling prohibited
Node self-healing obstructed: Tasks running on the abnormal node can impede the self-healing process. The following actions are necessary:
- DSW instance: Manually save the environment and shut down the instance, or configure automatic instance restarts through the DSW policy in the scheduling center.
- DLC task: Manually stop the job.

Limits

This feature is currently exclusive to Lingjun AI Computing Service resources and is supported in the China (Ulanqab) and Singapore regions.

Enable notifications

Enable notifications in the form of internal message or email when node scheduling is prohibited or your tasks are running on an abnormal node.

Log on to the PAI console.
In the upper right corner, click to go to the Message Center.
In the left-side navigation pane, choose Message Settings > Common Settings.
In the Notification Type column, find Product Message > Product operation notifications. Ensure the contact is added, then select Internal Messages or Email.
You will receive notifications detailing the affected node name, resource quota, and information about tasks running on the node if an abnormality is detected.

Procedure

After receiving a notification, follow these steps to clear DSW instances and DLC jobs from the abnormal node:

Migrate DSW instances

Method 1: Manual migration

For DSW instances on abnormal nodes, a browser pop-up will prompt you to save the environment and shut down the instance to support the self-healing of the node.

Method 2: Automatic migration

Log on to the PAI console.
In the left-side navigation pane, select Workspaces. On the Workspaces page, click the name of the desired workspace.
On the Workspace Details page, click the Scheduling Center tab.
In the DSW section, turn on the Enable Automatic Instance Migration from Abnormal Node switch.
Once enabled, the system will automatically restart the instance in case of node abnormality, supporting the self-healing process and ensuring resource availability. The restart saves the environment image, but running processes is not recoverable.

For DSW instances on abnormal nodes, a browser pop-up will prompt you to save the environment and shut down the instance. It also shows the time remaining before an automatic restart.

Stop DLC jobs

Click the link in the internal message or email to go to the Resource Quota page.
Click and view the tasks under the node based on the node information provided by the notification.
Click the name of the affected DLC job. Then, in the upper right corner, click More > Stop.
Click Clone to replicate the job and reschedule it to a healthy node. For more information, see Clone a training job.

Feedback

Previous: Improve Internet access rate by using a private gatewayNext: Use ActionTrail to query behavior events

On this page （1, T）

Background information

Limits

Enable notifications

Procedure

Migrate DSW instances

Stop DLC jobs

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

Background information

Limits

Enable notifications

Procedure

Migrate DSW instances

Method 1: Manual migration

Method 2: Automatic migration

Stop DLC jobs

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)