ACK uses the ack-node-problem-detector (NPD) component to monitor GPU resource health. When a GPU node encounters an anomaly—such as an XID or SXID error—NPD automatically detects and fences the affected GPU card. This ensures that healthy GPUs continue serving workloads, minimizing business impact while improving cluster reliability and operational efficiency.
Prerequisites
The ack-node-problem-detector (NPD) component is installed and its version is 1.2.24 or later.
When you use ack-nvidia-device-plugin version 0.17.0 or later with NPD version 1.2.24 or later, NPD automatically fences an abnormal GPU card upon detecting an anomaly and automatically lifts the fence when the GPU recovers.
To view or upgrade the ack-nvidia-device-plugin version, see View the NVIDIA Device Plugin version.
ack-node-problem-detector (NPD) is a cluster node anomaly monitoring component enhanced by ACK based on the open source project node-problem-detector. It includes a comprehensive set of GPU-specific check items to improve anomaly detection in GPU-accelerated environments. When an anomaly is detected, NPD generates a corresponding Kubernetes Event or Kubernetes Node Condition, depending on the anomaly type.
Notes
When a GPU anomaly is detected, the ack-node-problem-detector component creates an NVIDIA GPU quarantined file according to the default fencing policy. The ack-nvidia-device-plugin component then fences the affected GPU card based on this file. This prevents new workloads from being scheduled to the faulty GPU, avoiding task failures. Healthy GPUs remain available for scheduling. However, if fencing leaves insufficient GPUs on the node—for example, only seven cards remain for an eight-GPU task—the task cannot be scheduled, potentially leaving GPU resources idle. Automatic fencing is not automatic repair. The node instance continues to incur charges even after its GPU is fenced. You must still manually repair the node. Configure GPU anomaly alerts to enable prompt response.
You can disable automatic GPU card fencing as needed. For instructions, see How do I disable the automatic GPU card fencing feature of NPD?. Specific versions of the NVIDIA Device Plugin support automatic GPU fencing, but the method to disable it varies. For details, see How do I disable the native GPU fencing feature of NVIDIA Device Plugin?.
The NVIDIA GPU driver writes XID and SXID errors to
/var/log/messagesor/var/log/syslogusing the NVRM event mechanism. NPD tracks whether each XID and SXID has been processed. If you restart a node after an XID or SXID occurs, NPD will not generate an Event or Node Condition for that error—even if the underlying issue persists, such as XID 79, which indicates the GPU device must be replaced. NPD treats the XID as resolved after the restart.NPD detects NVIDIA XID and SXID errors by scanning the
/var/log/messagesor/var/log/syslogfile on the node. If dmesg logs are redirected to another file, NPD cannot detect these errors.Starting with NPD version 1.2.29, the GPU anomaly detection plugin is deployed separately as a DaemonSet named ack-accel-health-monitor.
In some cases, a GPU anomaly on a node may prevent GPU containers from starting. This can also block the GPU anomaly detection container from launching, halting the detection process.
The NPD GPU detection plugin pod requires high privileges—such as
privileged=true—to inspect GPU devices and components. For more information, see the following table.Cluster RBAC permissions
Container permissions
Node: get
Node/Status: update
Events: create
privileged: trueRead-only mount of the host
/dev/kmsgRead-only mount of the host
/usr/libRead-only mount of the host
/etcRead-only mount of the host
/usr/lib64Read-only mount of the host
/proc
Check items and repair suggestions
After identifying a GPU anomaly, refer to NVIDIA XID Errors for repair guidance. You can also review O&M events for the node instance in the console of the corresponding cloud product—such as ECS or Lingjun—based on the instance type. Alternatively, use a self-diagnosis tool to identify hardware anomalies on the node.
A Repair suggestion of None means no hardware intervention is required. Review your application configuration instead.
Check item name | Does it generate a Node Condition? | Is an event generated? | Description | Are GPU cards fenced by default? | Repair suggestion |
NvidiaXID13Error | No | Yes
|
| No | None |
NvidiaXID31Error | No | Yes
|
| No | None |
NvidiaXID43Error | No | Yes
|
| No | None |
NvidiaXID44Error | Yes
| Yes
|
| Yes (NPD <= 1.2.28) | Restart the node. |
NvidiaXID45Error | No | Yes
|
| No | None |
NvidiaXID48Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID61Error | Yes
| Yes
|
| Yes (NPD <= 1.2.28) | Restart the node. |
NvidiaXID62Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID63Error | No | Yes
|
| No | None |
NvidiaXID64Error | No | Yes
|
| No | None |
NvidiaXID69Error | Yes
| Yes
|
| Yes (NPD <= 1.2.28) | Restart the node. |
NvidiaXID74Error | Yes
| Yes
|
| Yes | Hardware repair. |
NvidiaXID79Error | Yes
| Yes
|
| Yes | Hardware repair. |
NvidiaXID94Error | No | Yes
|
| No | None |
NvidiaXID95Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID109Error | Yes
| Yes
|
| Yes (NPD <= 1.2.28) | None |
NvidiaXID119Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID120Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID140Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID[code]Error | No | Yes (generates only three events)
| Other XIDs not listed in this table. | No | |
NvidiaSXID[code]Error | No | Yes (generates only three events)
|
| No | None |
NvidiaEccModeNotEnabled | Yes
| Yes (generates events continuously until the issue is fixed)
| ECC Mode is not enabled on the node. | No | Enable ECC Mode and restart the node. |
NvidiaPendingRetiredPages | Yes
| Yes (generates events continuously until the issue is fixed)
|
| Yes | Restart the node. |
NvidiaRemappingRowsFailed | Yes
| Yes (generates events continuously until the issue is fixed)
| The GPU has failed row remapping. | Yes | Hardware repair. |
NvidiaRemappingRowsRequireReset | Yes
| Yes (generates events continuously until the issue is fixed)
| The GPU encountered an uncorrectable, uncontained error that requires a GPU reset for recovery. Reset the GPU as soon as possible to restore operation. | Yes (NPD <= 1.2.28) | Restart the node. |
NvidiaDeviceLost | Yes
| Yes (generates events continuously until the issue is fixed)
|
| Yes | Hardware repair. |
NvidiaInfoRomCorrupted | Yes
| Yes (generates events continuously until the issue is fixed)
|
| Yes | Hardware repair. |
NvidiaPowerCableErr | Yes
| Yes (generates events continuously until the issue is fixed)
|
| Yes | Hardware repair. |
NvidiaPersistencedOffline | Yes
| Yes
| The Nvidia Persistenced service is not running. | No | Restart the nvidia-persistenced service. |
NvidiaFabricManagerOffline | Yes
| Yes
| The Nvidia Fabric Manager service is not running. | No | Restart the Fabric Manager service. |
NvidiaTemperatureHigh | Yes
| Yes
| The GPU temperature exceeds 100 degrees Celsius. | No | None |
NvidiaNVLinkStateErr | Yes
| Yes
| The Nvidia NVLink state is down. | No | Restart the machine. |
Other related events
In exclusive GPU scenarios, NPD automatically fences GPU cards based on the anomaly check items. After fencing, new GPU application pods are not assigned to the affected card. To verify the fencing effect, check the number of nvidia.com/gpu resources reported on the Kubernetes Node. After the GPU card recovers, ACK automatically lifts the fence.
Cause | Event content | Description |
GPU fencing | Yes
| The GPU card is fenced due to detected anomalies. |
GPU card fencing deactivation | Yes
| The GPU card has recovered from the anomaly, and the fencing is deactivated. |
FAQ
How do I disable the automatic GPU card fencing feature of NPD?
Background
When a GPU on a node becomes abnormal, ACK automatically fences it through NPD to prevent tasks from being scheduled to it. However, automatic fencing does not perform automatic repair. The node instance continues to incur charges even after its GPU is fenced. You must still manually restart or repair the node. Configure GPU anomaly alerts to enable prompt handling.
After fencing, if the remaining GPUs on the node are insufficient for a task—for example, only seven cards remain for an eight-GPU task—the task cannot be scheduled. This may leave GPU resources idle.
After the GPU status returns to normal, the fence is automatically lifted.
To disable automatic fencing so that an abnormal GPU continues reporting its resources and remains schedulable, follow the solution below.
Solution
Starting with ack-node-problem-detector version 1.2.30, you can control automatic GPU fencing using the generateNvidiaGpuIsolationFile configuration item in Component Management.
Disable the automatic GPU fencing feature of NPD.
(Recommended) Method 1: Modify the component configuration in Component Management.
On the Clusters page, click the name of the destination cluster. In the left navigation pane, click Add-ons.
On the Logs and Monitoring tab, locate the ack-node-problem-detector component and perform the appropriate action based on its version.
Versions 1.2.24 to 1.2.29: Check for available upgrades. If version 1.2.30 or later is available, click Upgrade.
Version 1.2.30 is in grayscale release. If you do not see version 1.2.30 or later, submit a ticket to request access.
Versions 1.2.30 and later: Click Configuration.
On the component upgrade or configuration page, set
generateNvidiaGpuIsolationFile(Generate NVIDIA GPU quarantined file) tofalse, and then click OK.NoteIf you previously used Method 2 to temporarily disable automatic GPU fencing, this setting is retained during NPD upgrades. To re-enable automatic GPU card fencing, set
generateNvidiaGpuIsolationFiletotrue.
Method 2: Manually modify the configuration using YAML.
NoteThe following method is a temporary workaround. The configuration is lost if you upgrade NPD to a version earlier than 1.2.30. You must reconfigure it after the upgrade. We recommend upgrading to version 1.2.30 or later to make this configuration persistent.
Edit the NPD component YAML.
kubectl edit ds -n kube-system ack-node-problem-detector-daemonsetSet the
EnabledIsolateGPUconfiguration tofalse.Before:
--EnabledIsolateGPU=trueAfter:
--EnabledIsolateGPU=false
Deactivate existing automatic GPU card fencing.
To deactivate existing fencing on a GPU card, log on to the node where the XID error occurred and delete the
/etc/nvidia-device-plugin/unhealthyDevices.jsonfile. To prevent the card from being fenced again, disable the automatic fencing feature as described in the previous step.