Fault isolation
Platform for AI (PAI) divides a region into multiple zones. Each zone is an isolated area that has its own power supply and network.
Zones in the same region are connected by using a low-latency internal network. To ensure that incidents in one zone do not affect operations in another zone, fault isolation is enabled between zones.
Elastic fault tolerance
PAI provides AIMaster, which is an elastic fault-tolerant engine that facilitates execution of Deep Learning Container (DLC) jobs. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.
Sanity check
The sanity check feature of DLC allows you to check the health status and performance of computing resources that are used to run DLC jobs. You can enable the sanity check feature when you create a DLC job. If you enable the sanity check feature, the system automatically examines the resources related to the job, isolates malfunctioning nodes, and triggers an automated O&M process in the background. The sanity check feature can reduce failures at an early stage and increase the success rate of a job. After a sanity check is complete, the system provides a test report on the computing power and communication performance of the related GPUs. You can use the report to identify potential risks that may impair the training performance, which improves troubleshooting efficiency. For more information, see Sanity check.
Infrastructure monitoring
You can use CloudMonitor to build and reinforce your security defense system. CloudMonitor provides the following feature for PAI:
Inference monitoring for the Elastic Algorithm Service (EAS) module of PAI: For more information, see View ServiceInstance events in CloudMonitor.