GPU anomaly detection and automatic fencing - Container Service for Kubernetes

ACK uses the ack-node-problem-detector (NPD) component to monitor GPU resource health. When a GPU node encounters an anomaly—such as an XID or SXID error—NPD automatically detects and fences the affected GPU card. This ensures that healthy GPUs continue serving workloads, minimizing business impact while improving cluster reliability and operational efficiency.

Prerequisites

The ack-node-problem-detector (NPD) component is installed and its version is 1.2.24 or later.
When you use ack-nvidia-device-plugin version 0.17.0 or later with NPD version 1.2.24 or later, NPD automatically fences an abnormal GPU card upon detecting an anomaly and automatically lifts the fence when the GPU recovers.
To view or upgrade the ack-nvidia-device-plugin version, see View the NVIDIA Device Plugin version.

ack-node-problem-detector (NPD) is a cluster node anomaly monitoring component enhanced by ACK based on the open source project node-problem-detector. It includes a comprehensive set of GPU-specific check items to improve anomaly detection in GPU-accelerated environments. When an anomaly is detected, NPD generates a corresponding Kubernetes Event or Kubernetes Node Condition, depending on the anomaly type.

Notes

When a GPU anomaly is detected, the ack-node-problem-detector component creates an NVIDIA GPU quarantined file according to the default fencing policy. The ack-nvidia-device-plugin component then fences the affected GPU card based on this file. This prevents new workloads from being scheduled to the faulty GPU, avoiding task failures. Healthy GPUs remain available for scheduling. However, if fencing leaves insufficient GPUs on the node—for example, only seven cards remain for an eight-GPU task—the task cannot be scheduled, potentially leaving GPU resources idle. Automatic fencing is not automatic repair. The node instance continues to incur charges even after its GPU is fenced. You must still manually repair the node. Configure GPU anomaly alerts to enable prompt response.
You can disable automatic GPU card fencing as needed. For instructions, see How do I disable the automatic GPU card fencing feature of NPD?. Specific versions of the NVIDIA Device Plugin support automatic GPU fencing, but the method to disable it varies. For details, see How do I disable the native GPU fencing feature of NVIDIA Device Plugin?.
The NVIDIA GPU driver writes XID and SXID errors to /var/log/messages or /var/log/syslog using the NVRM event mechanism. NPD tracks whether each XID and SXID has been processed. If you restart a node after an XID or SXID occurs, NPD will not generate an Event or Node Condition for that error—even if the underlying issue persists, such as XID 79, which indicates the GPU device must be replaced. NPD treats the XID as resolved after the restart.
NPD detects NVIDIA XID and SXID errors by scanning the /var/log/messages or /var/log/syslog file on the node. If dmesg logs are redirected to another file, NPD cannot detect these errors.
Starting with NPD version 1.2.29, the GPU anomaly detection plugin is deployed separately as a DaemonSet named ack-accel-health-monitor.
In some cases, a GPU anomaly on a node may prevent GPU containers from starting. This can also block the GPU anomaly detection container from launching, halting the detection process.

The NPD GPU detection plugin pod requires high privileges—such as privileged=true—to inspect GPU devices and components. For more information, see the following table.

Cluster RBAC permissions

Container permissions

Node: get

Node/Status: update

Events: create

privileged: true

Read-only mount of the host /dev/kmsg

Read-only mount of the host /usr/lib

Read-only mount of the host /etc

Read-only mount of the host /usr/lib64

Read-only mount of the host /proc

Check items and repair suggestions

After identifying a GPU anomaly, refer to NVIDIA XID Errors for repair guidance. You can also review O&M events for the node instance in the console of the corresponding cloud product—such as ECS or Lingjun—based on the instance type. Alternatively, use a self-diagnosis tool to identify hardware anomalies on the node.

A Repair suggestion of None means no hardware intervention is required. Review your application configuration instead.

Check item name	Does it generate a Node Condition?	Is an event generated?	Description	Are GPU cards fenced by default?	Repair suggestion
NvidiaXID13Error	No	Yes `Type: Warning` `Reason: NvidiaXID13Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 13 error has occurred.`	`Graphics Engine Exception.` This error is usually caused by an array-index out of bounds or an instruction error. Hardware failure is rare.	No	None
NvidiaXID31Error	No	Yes `Type: Warning` `Reason: NvidiaXID31Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 31 error has occurred.`	`GPU memory page fault.` This error is typically caused by illegal address access from an application. Driver or hardware issues are rare.	No	None
NvidiaXID43Error	No	Yes `Type: Warning` `Reason: NvidiaXID43Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 43 error has occurred.`	`GPU stopped processing.` This event occurs when your application encounters a software-induced anomaly and must terminate. The GPU remains healthy. In most cases, this does not indicate a driver problem but rather an application-level error.	No	None
NvidiaXID44Error	Yes `Type: NvidiaXID44Error` `Reason: NodeHasNvidiaXID44Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 44 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID44Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 44 error has occurred.`	`Graphics Engine fault during context switch.` A graphics engine fault occurred during a context switch.	Yes (NPD <= 1.2.28) No (NPD >= 1.2.30)	Restart the node.
NvidiaXID45Error	No	Yes `Type: Warning` `Reason: NvidiaXID45Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 45 error has occurred.`	`Preemptive cleanup, due to previous errors - Most likely to see when running multiple cuda applications and hitting a DBE.` This event occurs when your application is aborted and the kernel driver terminates the GPU application running on the GPU. Actions that can abort an application and trigger this event include Control-C, a GPU reset, and sigkill. In many cases, this does not indicate an error but results from an action performed by you or the system.	No	None
NvidiaXID48Error	Yes `Type: NvidiaXID48Error` `Reason: NodeHasNvidiaXID48Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 48 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID48Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 48 error has occurred.`	`Double Bit ECC Error (DBE).` This event occurs when the GPU detects an uncorrectable error. The error is also reported to the application. Reset the GPU or restart the node to clear it.	Yes	Restart the node.
NvidiaXID61Error	Yes `Type: NvidiaXID61Error` `Reason: NodeHasNvidiaXID61Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 61 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID61Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 61 error has occurred.`	`Internal micro-controller breakpoint/warning (newer drivers).` Internal micro-controller breakpoint/warning (newer drivers).	Yes (NPD <= 1.2.28) No (NPD >= 1.2.30)	Restart the node.
NvidiaXID62Error	Yes `Type: NvidiaXID62Error` `Reason: NodeHasNvidiaXID62Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 62 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID62Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 62 error has occurred.`	`Internal micro-controller halt (newer drivers).` Internal micro-controller halt (newer drivers).	Yes	Restart the node.
NvidiaXID63Error	No	Yes `Type: Warning` `Reason: NvidiaXID63Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 63 error has occurred.`	`ECC page retirement or row remapping recording event.` When an application encounters a GPU memory hardware error, the NVIDIA self-correction mechanism retires or remaps the faulty memory region. The retirement or remapping information must be recorded in the infoROM for the change to persist. Volta architecture: The ECC page retirement event is successfully recorded to the infoROM. Ampere architecture: The row remapping event is successfully recorded to the infoROM.	No	None
NvidiaXID64Error	No	Yes `Type: Warning` `Reason: NvidiaXID64Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 64 error has occurred.`	`ECC page retirement or row remapper recording failure.` The trigger scenario is similar to XID 63. However, XID 63 indicates successful recording to the infoROM, whereas XID 64 indicates a recording failure.	No	None
NvidiaXID69Error	Yes `Type: NvidiaXID69Error` `Reason: NodeHasNvidiaXID69Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 69 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID69Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 69 error has occurred.`	`Graphics Engine class error.` Graphics engine class error.	Yes (NPD <= 1.2.28) No (NPD >= 1.2.30)	Restart the node.
NvidiaXID74Error	Yes `Type: NvidiaXID74Error` `Reason: NodeHasNvidiaXID74Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 74 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID74Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 74 error has occurred.`	`Fatal NVLink Error.` An XID generated by an NVLink hardware error.	Yes	Hardware repair.
NvidiaXID79Error	Yes `Type: NvidiaXID79Error` `Reason: NodeHasNvidiaXID79Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 79 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID79Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 79 error has occurred.`	`GPU has fallen off the bus.` The GPU hardware has fallen off the bus and is no longer detectable.	Yes	Hardware repair.
NvidiaXID94Error	No	Yes `Type: Warning` `Reason: NvidiaXID94Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 94 error has occurred.`	`Contained ECC error.` When an application encounters an uncorrectable GPU memory ECC error, the NVIDIA error suppression (contained) mechanism attempts to isolate the error within that application. This prevents the error from affecting all applications on the GPU. A successful containment triggers an XID 94 event, which affects only the application that encountered the error.	No	None
NvidiaXID95Error	Yes `Type: NvidiaXID95Error` `Reason: NodeHasNvidiaXID95Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 95 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID95Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 95 error has occurred.`	`Uncontained ECC error.` XID 95 indicates containment failure. All applications running on the GPU are affected. Reset the GPU before restarting the applications.	Yes	Restart the node.
NvidiaXID109Error	Yes `Type: NvidiaXID109Error` `Reason: NodeHasNvidiaXID109Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 109 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID109Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 109 error has occurred.`	`Context Switch Timeout Error.` Context switch timeout error.	Yes (NPD <= 1.2.28) No (NPD >= 1.2.30)	None
NvidiaXID119Error	Yes `Type: NvidiaXID119Error` `Reason: NodeHasNvidiaXID119Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 119 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID119Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 119 error has occurred.`	`GSP RPC Timeout.` A timeout occurred while waiting for the GSP core to respond to an RPC message.	Yes	Restart the node.
NvidiaXID120Error	Yes `Type: NvidiaXID120Error` `Reason: NodeHasNvidiaXID120Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 120 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID120Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 120 error has occurred.`	`GSP Error.` An error occurred in the code running on the GPU's GSP core.	Yes	Restart the node.
NvidiaXID140Error	Yes `Type: NvidiaXID140Error` `Reason: NodeHasNvidiaXID140Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 140 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID140Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 140 error has occurred.`	`Unrecovered ECC Error.` This event occurs when the GPU driver detects uncorrectable errors in GPU memory that affect its ability to mark pages for dynamic page offlining or row remapping. Reset the GPU.	Yes	Restart the node.
NvidiaXID[code]Error	No	Yes (generates only three events) `Type: Warning` `Reason: NvidiaXID[code]Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid [code] error has occurred.`	Other XIDs not listed in this table.	No	Submit a ticket.
NvidiaSXID[code]Error	No	Yes (generates only three events) `Type: Warning` `Reason: NvidiaSXID[code]Error` `Message: TS=xxx;NVSwitchIds=xxx;MSG=An nvidia sxid [code] error has occurred.`	SXID errors fall into three categories: Correctable: The error has been corrected. System behavior is unaffected. No additional recovery is needed. Fatal: The error is fatal to the device. System behavior is affected. Recovery requires resetting the device or restarting the system. Non-fatal: The error is not fatal to the device. System behavior is affected. Resetting the device or restarting the system may not be necessary.	No	None
NvidiaEccModeNotEnabled	Yes `Type: NvidiaEccModeNotEnabled` `Reason: EccModeNotEnabled` `Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaEccModeNotEnabled` `Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.`	ECC Mode is not enabled on the node.	No	Enable ECC Mode and restart the node.
NvidiaPendingRetiredPages	Yes `Type: NvidiaPendingRetiredPages` `Reason: NodeHasNvidiaPendingRetiredPages` `Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaPendingRetiredPages` `Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.`	The GPU has retired pages in a pending state. Reset the GPU for these retired pages to take effect.	Yes	Restart the node.
NvidiaRemappingRowsFailed	Yes `Type: NvidiaRemappedRowsFailed` `Reason: GPUMemoryRemappingRowsFailed` `Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaRemappedRowsFailed` `Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.`	The GPU has failed row remapping.	Yes	Hardware repair.
NvidiaRemappingRowsRequireReset	Yes `Type: NvidiaRemappingRowsRequireReset` `Reason: UncontainedEccError` `Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaRemappingRowsRequireReset` `Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.`	The GPU encountered an uncorrectable, uncontained error that requires a GPU reset for recovery. Reset the GPU as soon as possible to restore operation.	Yes (NPD <= 1.2.28) No (NPD >= 1.2.30)	Restart the node.
NvidiaDeviceLost	Yes `Type: NvidiaDeviceLost` `Reason: NodeHasNvidiaDeviceLost` `Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaDeviceLost` `Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible.`	`The GPU has fallen off the bus or has otherwise become inaccessible.` The GPU has fallen off the bus or has otherwise become inaccessible.	Yes	Hardware repair.
NvidiaInfoRomCorrupted	Yes `Type: NvidiaInfoRomCorrupted` `Reason: NodeHasNvidiaInfoRomCorrupted` `Message: GpuIds=xxx;MSG=GPU infoROM is corrupted`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaInfoRomCorrupted` `Message: GpuIds=xxx;MSG=GPU infoROM is corrupted.`	`infoROM is corrupted.` The infoROM is corrupted.	Yes	Hardware repair.
NvidiaPowerCableErr	Yes `Type: NvidiaPowerCableErr` `Reason: NodeHasNvidiaPowerCableErr` `Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached`	Yes (generates events continuously until the issue is fixed) `Type: Warning` `Reason: NvidiaPowerCableErr` `Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached.`	`A device's external power cables are not properly attached.` A device's external power cables are not properly attached.	Yes	Hardware repair.
NvidiaPersistencedOffline	Yes `Type: NvidiaPersistencedOffline` `Reason: NodeHasNvidiaPersistencedOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Persistenced service is not running.`	Yes `Type: Warning` `Reason: NvidiaPersistencedOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Persistenced service is not running.`	The Nvidia Persistenced service is not running.	No	Restart the nvidia-persistenced service.
NvidiaFabricManagerOffline	Yes `Type: NvidiaFabricManagerOffline` `Reason: NodeHasNvidiaFabricManagerOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Fabric Manager service is not running.`	Yes `Type: Warning` `Reason: NvidiaFabricManagerOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Fabric Manager service is not running.`	The Nvidia Fabric Manager service is not running.	No	Restart the Fabric Manager service.
NvidiaTemperatureHigh	Yes `Type: NvidiaTemperatureHigh` `Reason: NodeHasNvidiaTemperatureHigh` `Message: TS=xxx;GpuIds=xxx;Nvidia gpu temperature exceeds threshold`	Yes `Type: Warning` `Reason: NvidiaTemperatureHigh` `Message: TS=xxx;GpuIds=xxx;Nvidia gpu temperature exceeds threshold`	The GPU temperature exceeds 100 degrees Celsius.	No	None
NvidiaNVLinkStateErr	Yes `Type: NvidiaNVLinkStateErr` `Reason: NodeHasNvlinkStateErr` `Message: TS=xxx;GpuIds=xxx;Nvidia nvlink state is down`	Yes `Type: Warning` `Reason: NvidiaNvlinkStateErr` `Message: TS=xxx;GpuIds=xxx;Nvidia nvlink state is down`	The Nvidia NVLink state is down.	No	Restart the machine.

Other related events

In exclusive GPU scenarios, NPD automatically fences GPU cards based on the anomaly check items. After fencing, new GPU application pods are not assigned to the affected card. To verify the fencing effect, check the number of nvidia.com/gpu resources reported on the Kubernetes Node. After the GPU card recovers, ACK automatically lifts the fence.

Cause

Event content

Description

GPU fencing

Yes

Type: Warning
Reason: NvidiaDeviceIsolated
Message: GpuIds=xxx;MSG=nvidia device has been isolated due to detected issues.

The GPU card is fenced due to detected anomalies.

GPU card fencing deactivation

Yes

Type: Normal
Reason: NvidiaDeviceRecovered
Message: GpuIds=xxx;MSG=nvidia device has recovered from the fault.

The GPU card has recovered from the anomaly, and the fencing is deactivated.

FAQ

How do I disable the automatic GPU card fencing feature of NPD?

Background

When a GPU on a node becomes abnormal, ACK automatically fences it through NPD to prevent tasks from being scheduled to it. However, automatic fencing does not perform automatic repair. The node instance continues to incur charges even after its GPU is fenced. You must still manually restart or repair the node. Configure GPU anomaly alerts to enable prompt handling.

After fencing, if the remaining GPUs on the node are insufficient for a task—for example, only seven cards remain for an eight-GPU task—the task cannot be scheduled. This may leave GPU resources idle.
After the GPU status returns to normal, the fence is automatically lifted.
To disable automatic fencing so that an abnormal GPU continues reporting its resources and remains schedulable, follow the solution below.

Solution

Note

Starting with ack-node-problem-detector version 1.2.30, you can control automatic GPU fencing using the generateNvidiaGpuIsolationFile configuration item in Component Management.

Disable the automatic GPU fencing feature of NPD.
- (Recommended) Method 1: Modify the component configuration in Component Management.
  1. On the Clusters page, click the name of the destination cluster. In the left navigation pane, click Add-ons.
  2. On the Logs and Monitoring tab, locate the ack-node-problem-detector component and perform the appropriate action based on its version.
    - Versions 1.2.24 to 1.2.29: Check for available upgrades. If version 1.2.30 or later is available, click Upgrade.
      Version 1.2.30 is in grayscale release. If you do not see version 1.2.30 or later, submit a ticket to request access.
    - Versions 1.2.30 and later: Click Configuration.
  3. On the component upgrade or configuration page, set generateNvidiaGpuIsolationFile (Generate NVIDIA GPU quarantined file) to false, and then click OK.
    Note
    If you previously used Method 2 to temporarily disable automatic GPU fencing, this setting is retained during NPD upgrades. To re-enable automatic GPU card fencing, set generateNvidiaGpuIsolationFile to true.
- Method 2: Manually modify the configuration using YAML.
  Note
  The following method is a temporary workaround. The configuration is lost if you upgrade NPD to a version earlier than 1.2.30. You must reconfigure it after the upgrade. We recommend upgrading to version 1.2.30 or later to make this configuration persistent.
  1. Edit the NPD component YAML.
```
kubectl edit ds -n kube-system ack-node-problem-detector-daemonset
```
  2. Set the EnabledIsolateGPU configuration to false.
    Before:
    --EnabledIsolateGPU=true
    After:
    --EnabledIsolateGPU=false
Deactivate existing automatic GPU card fencing.
To deactivate existing fencing on a GPU card, log on to the node where the XID error occurred and delete the /etc/nvidia-device-plugin/unhealthyDevices.json file. To prevent the card from being fenced again, disable the automatic fencing feature as described in the previous step.