All Products
Search
Document Center

Container Service for Kubernetes:GPU anomaly detection and automatic fencing

Last Updated:Feb 12, 2026

ACK uses the ack-node-problem-detector (NPD) component to monitor GPU resource health. When a GPU node encounters an anomaly—such as an XID or SXID error—NPD automatically detects and fences the affected GPU card. This ensures that healthy GPUs continue serving workloads, minimizing business impact while improving cluster reliability and operational efficiency.

Prerequisites

  • The ack-node-problem-detector (NPD) component is installed and its version is 1.2.24 or later.

  • When you use ack-nvidia-device-plugin version 0.17.0 or later with NPD version 1.2.24 or later, NPD automatically fences an abnormal GPU card upon detecting an anomaly and automatically lifts the fence when the GPU recovers.

    To view or upgrade the ack-nvidia-device-plugin version, see View the NVIDIA Device Plugin version.

ack-node-problem-detector (NPD) is a cluster node anomaly monitoring component enhanced by ACK based on the open source project node-problem-detector. It includes a comprehensive set of GPU-specific check items to improve anomaly detection in GPU-accelerated environments. When an anomaly is detected, NPD generates a corresponding Kubernetes Event or Kubernetes Node Condition, depending on the anomaly type.

Notes

  • When a GPU anomaly is detected, the ack-node-problem-detector component creates an NVIDIA GPU quarantined file according to the default fencing policy. The ack-nvidia-device-plugin component then fences the affected GPU card based on this file. This prevents new workloads from being scheduled to the faulty GPU, avoiding task failures. Healthy GPUs remain available for scheduling. However, if fencing leaves insufficient GPUs on the node—for example, only seven cards remain for an eight-GPU task—the task cannot be scheduled, potentially leaving GPU resources idle. Automatic fencing is not automatic repair. The node instance continues to incur charges even after its GPU is fenced. You must still manually repair the node. Configure GPU anomaly alerts to enable prompt response.

    You can disable automatic GPU card fencing as needed. For instructions, see How do I disable the automatic GPU card fencing feature of NPD?. Specific versions of the NVIDIA Device Plugin support automatic GPU fencing, but the method to disable it varies. For details, see How do I disable the native GPU fencing feature of NVIDIA Device Plugin?.
  • The NVIDIA GPU driver writes XID and SXID errors to /var/log/messages or /var/log/syslog using the NVRM event mechanism. NPD tracks whether each XID and SXID has been processed. If you restart a node after an XID or SXID occurs, NPD will not generate an Event or Node Condition for that error—even if the underlying issue persists, such as XID 79, which indicates the GPU device must be replaced. NPD treats the XID as resolved after the restart.

  • NPD detects NVIDIA XID and SXID errors by scanning the /var/log/messages or /var/log/syslog file on the node. If dmesg logs are redirected to another file, NPD cannot detect these errors.

  • Starting with NPD version 1.2.29, the GPU anomaly detection plugin is deployed separately as a DaemonSet named ack-accel-health-monitor.

  • In some cases, a GPU anomaly on a node may prevent GPU containers from starting. This can also block the GPU anomaly detection container from launching, halting the detection process.

  • The NPD GPU detection plugin pod requires high privileges—such as privileged=true—to inspect GPU devices and components. For more information, see the following table.

    Cluster RBAC permissions

    Container permissions

    Node: get

    Node/Status: update

    Events: create

    privileged: true

    Read-only mount of the host /dev/kmsg

    Read-only mount of the host /usr/lib

    Read-only mount of the host /etc

    Read-only mount of the host /usr/lib64

    Read-only mount of the host /proc

Check items and repair suggestions

After identifying a GPU anomaly, refer to NVIDIA XID Errors for repair guidance. You can also review O&M events for the node instance in the console of the corresponding cloud product—such as ECS or Lingjun—based on the instance type. Alternatively, use a self-diagnosis tool to identify hardware anomalies on the node.

A Repair suggestion of None means no hardware intervention is required. Review your application configuration instead.

Check item name

Does it generate a Node Condition?

Is an event generated?

Description

Are GPU cards fenced by default?

Repair suggestion

NvidiaXID13Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID13Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 13 error has occurred.

  • Graphics Engine Exception.

  • This error is usually caused by an array-index out of bounds or an instruction error. Hardware failure is rare.

No

None

NvidiaXID31Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID31Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 31 error has occurred.

  • GPU memory page fault.

  • This error is typically caused by illegal address access from an application. Driver or hardware issues are rare.

No

None

NvidiaXID43Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID43Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 43 error has occurred.

  • GPU stopped processing.

  • This event occurs when your application encounters a software-induced anomaly and must terminate. The GPU remains healthy.

  • In most cases, this does not indicate a driver problem but rather an application-level error.

No

None

NvidiaXID44Error

Yes

  • Type: NvidiaXID44Error

  • Reason: NodeHasNvidiaXID44Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 44 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID44Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 44 error has occurred.

  • Graphics Engine fault during context switch.

  • A graphics engine fault occurred during a context switch.

Yes (NPD <= 1.2.28)
No (NPD >= 1.2.30)

Restart the node.

NvidiaXID45Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID45Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 45 error has occurred.

  • Preemptive cleanup, due to previous errors - Most likely to see when running multiple cuda applications and hitting a DBE.

  • This event occurs when your application is aborted and the kernel driver terminates the GPU application running on the GPU.

  • Actions that can abort an application and trigger this event include Control-C, a GPU reset, and sigkill.

  • In many cases, this does not indicate an error but results from an action performed by you or the system.

No

None

NvidiaXID48Error

Yes

  • Type: NvidiaXID48Error

  • Reason: NodeHasNvidiaXID48Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 48 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID48Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 48 error has occurred.

  • Double Bit ECC Error (DBE).

  • This event occurs when the GPU detects an uncorrectable error. The error is also reported to the application. Reset the GPU or restart the node to clear it.

Yes

Restart the node.

NvidiaXID61Error

Yes

  • Type: NvidiaXID61Error

  • Reason: NodeHasNvidiaXID61Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 61 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID61Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 61 error has occurred.

  • Internal micro-controller breakpoint/warning (newer drivers).

  • Internal micro-controller breakpoint/warning (newer drivers).

Yes (NPD <= 1.2.28)
No (NPD >= 1.2.30)

Restart the node.

NvidiaXID62Error

Yes

  • Type: NvidiaXID62Error

  • Reason: NodeHasNvidiaXID62Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 62 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID62Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 62 error has occurred.

  • Internal micro-controller halt (newer drivers).

  • Internal micro-controller halt (newer drivers).

Yes

Restart the node.

NvidiaXID63Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID63Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 63 error has occurred.

  • ECC page retirement or row remapping recording event.

  • When an application encounters a GPU memory hardware error, the NVIDIA self-correction mechanism retires or remaps the faulty memory region. The retirement or remapping information must be recorded in the infoROM for the change to persist.

  • Volta architecture: The ECC page retirement event is successfully recorded to the infoROM.

  • Ampere architecture: The row remapping event is successfully recorded to the infoROM.

No

None

NvidiaXID64Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID64Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 64 error has occurred.

  • ECC page retirement or row remapper recording failure.

  • The trigger scenario is similar to XID 63. However, XID 63 indicates successful recording to the infoROM, whereas XID 64 indicates a recording failure.

No

None

NvidiaXID69Error

Yes

  • Type: NvidiaXID69Error

  • Reason: NodeHasNvidiaXID69Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 69 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID69Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 69 error has occurred.

  • Graphics Engine class error.

  • Graphics engine class error.

Yes (NPD <= 1.2.28)
No (NPD >= 1.2.30)

Restart the node.

NvidiaXID74Error

Yes

  • Type: NvidiaXID74Error

  • Reason: NodeHasNvidiaXID74Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 74 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID74Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 74 error has occurred.

  • Fatal NVLink Error.

  • An XID generated by an NVLink hardware error.

Yes

Hardware repair.

NvidiaXID79Error

Yes

  • Type: NvidiaXID79Error

  • Reason: NodeHasNvidiaXID79Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 79 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID79Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 79 error has occurred.

  • GPU has fallen off the bus.

  • The GPU hardware has fallen off the bus and is no longer detectable.

Yes

Hardware repair.

NvidiaXID94Error

No

Yes

  • Type: Warning

  • Reason: NvidiaXID94Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 94 error has occurred.

  • Contained ECC error.

  • When an application encounters an uncorrectable GPU memory ECC error, the NVIDIA error suppression (contained) mechanism attempts to isolate the error within that application. This prevents the error from affecting all applications on the GPU. A successful containment triggers an XID 94 event, which affects only the application that encountered the error.

No

None

NvidiaXID95Error

Yes

  • Type: NvidiaXID95Error

  • Reason: NodeHasNvidiaXID95Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 95 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID95Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 95 error has occurred.

  • Uncontained ECC error.

  • XID 95 indicates containment failure. All applications running on the GPU are affected. Reset the GPU before restarting the applications.

Yes

Restart the node.

NvidiaXID109Error

Yes

  • Type: NvidiaXID109Error

  • Reason: NodeHasNvidiaXID109Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 109 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID109Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 109 error has occurred.

  • Context Switch Timeout Error.

  • Context switch timeout error.

Yes (NPD <= 1.2.28)
No (NPD >= 1.2.30)

None

NvidiaXID119Error

Yes

  • Type: NvidiaXID119Error

  • Reason: NodeHasNvidiaXID119Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 119 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID119Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 119 error has occurred.

  • GSP RPC Timeout.

  • A timeout occurred while waiting for the GSP core to respond to an RPC message.

Yes

Restart the node.

NvidiaXID120Error

Yes

  • Type: NvidiaXID120Error

  • Reason: NodeHasNvidiaXID120Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 120 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID120Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 120 error has occurred.

  • GSP Error.

  • An error occurred in the code running on the GPU's GSP core.

Yes

Restart the node.

NvidiaXID140Error

Yes

  • Type: NvidiaXID140Error

  • Reason: NodeHasNvidiaXID140Error

  • Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 140 error has occurred.

Yes

  • Type: Warning

  • Reason: NvidiaXID140Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 140 error has occurred.

  • Unrecovered ECC Error.

  • This event occurs when the GPU driver detects uncorrectable errors in GPU memory that affect its ability to mark pages for dynamic page offlining or row remapping. Reset the GPU.

Yes

Restart the node.

NvidiaXID[code]Error

No

Yes (generates only three events)

  • Type: Warning

  • Reason: NvidiaXID[code]Error

  • Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid [code] error has occurred.

Other XIDs not listed in this table.

No

Submit a ticket.

NvidiaSXID[code]Error

No

Yes (generates only three events)

  • Type: Warning

  • Reason: NvidiaSXID[code]Error

  • Message: TS=xxx;NVSwitchIds=xxx;MSG=An nvidia sxid [code] error has occurred.

  • SXID errors fall into three categories:

    • Correctable: The error has been corrected. System behavior is unaffected. No additional recovery is needed.

    • Fatal: The error is fatal to the device. System behavior is affected. Recovery requires resetting the device or restarting the system.

    • Non-fatal: The error is not fatal to the device. System behavior is affected. Resetting the device or restarting the system may not be necessary.

No

None

NvidiaEccModeNotEnabled

Yes

  • Type: NvidiaEccModeNotEnabled

  • Reason: EccModeNotEnabled

  • Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaEccModeNotEnabled

  • Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.

ECC Mode is not enabled on the node.

No

Enable ECC Mode and restart the node.

NvidiaPendingRetiredPages

Yes

  • Type: NvidiaPendingRetiredPages

  • Reason: NodeHasNvidiaPendingRetiredPages

  • Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaPendingRetiredPages

  • Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.

  • The GPU has retired pages in a pending state.

  • Reset the GPU for these retired pages to take effect.

Yes

Restart the node.

NvidiaRemappingRowsFailed

Yes

  • Type: NvidiaRemappedRowsFailed

  • Reason: GPUMemoryRemappingRowsFailed

  • Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaRemappedRowsFailed

  • Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.

The GPU has failed row remapping.

Yes

Hardware repair.

NvidiaRemappingRowsRequireReset

Yes

  • Type: NvidiaRemappingRowsRequireReset

  • Reason: UncontainedEccError

  • Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaRemappingRowsRequireReset

  • Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.

The GPU encountered an uncorrectable, uncontained error that requires a GPU reset for recovery. Reset the GPU as soon as possible to restore operation.

Yes (NPD <= 1.2.28)
No (NPD >= 1.2.30)

Restart the node.

NvidiaDeviceLost

Yes

  • Type: NvidiaDeviceLost

  • Reason: NodeHasNvidiaDeviceLost

  • Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaDeviceLost

  • Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible.

  • The GPU has fallen off the bus or has otherwise become inaccessible.

  • The GPU has fallen off the bus or has otherwise become inaccessible.

Yes

Hardware repair.

NvidiaInfoRomCorrupted

Yes

  • Type: NvidiaInfoRomCorrupted

  • Reason: NodeHasNvidiaInfoRomCorrupted

  • Message: GpuIds=xxx;MSG=GPU infoROM is corrupted

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaInfoRomCorrupted

  • Message: GpuIds=xxx;MSG=GPU infoROM is corrupted.

  • infoROM is corrupted.

  • The infoROM is corrupted.

Yes

Hardware repair.

NvidiaPowerCableErr

Yes

  • Type: NvidiaPowerCableErr

  • Reason: NodeHasNvidiaPowerCableErr

  • Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached

Yes (generates events continuously until the issue is fixed)

  • Type: Warning

  • Reason: NvidiaPowerCableErr

  • Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached.

  • A device's external power cables are not properly attached.

  • A device's external power cables are not properly attached.

Yes

Hardware repair.

NvidiaPersistencedOffline

Yes

  • Type: NvidiaPersistencedOffline

  • Reason: NodeHasNvidiaPersistencedOffline

  • Message: TS=xxx;GpuIds=xxx;Nvidia Persistenced service is not running.

Yes

  • Type: Warning

  • Reason: NvidiaPersistencedOffline

  • Message: TS=xxx;GpuIds=xxx;Nvidia Persistenced service is not running.

The Nvidia Persistenced service is not running.

No

Restart the nvidia-persistenced service.

NvidiaFabricManagerOffline

Yes

  • Type: NvidiaFabricManagerOffline

  • Reason: NodeHasNvidiaFabricManagerOffline

  • Message: TS=xxx;GpuIds=xxx;Nvidia Fabric Manager service is not running.

Yes

  • Type: Warning

  • Reason: NvidiaFabricManagerOffline

  • Message: TS=xxx;GpuIds=xxx;Nvidia Fabric Manager service is not running.

The Nvidia Fabric Manager service is not running.

No

Restart the Fabric Manager service.

NvidiaTemperatureHigh

Yes

  • Type: NvidiaTemperatureHigh

  • Reason: NodeHasNvidiaTemperatureHigh

  • Message: TS=xxx;GpuIds=xxx;Nvidia gpu temperature exceeds threshold

Yes

  • Type: Warning

  • Reason: NvidiaTemperatureHigh

  • Message: TS=xxx;GpuIds=xxx;Nvidia gpu temperature exceeds threshold

The GPU temperature exceeds 100 degrees Celsius.

No

None

Other related events

In exclusive GPU scenarios, NPD automatically fences GPU cards based on the anomaly check items. After fencing, new GPU application pods are not assigned to the affected card. To verify the fencing effect, check the number of nvidia.com/gpu resources reported on the Kubernetes Node. After the GPU card recovers, ACK automatically lifts the fence.

Cause

Event content

Description

GPU fencing

Yes

  • Type: Warning

  • Reason: NvidiaDeviceIsolated

  • Message: GpuIds=xxx;MSG=nvidia device has been isolated due to detected issues.

The GPU card is fenced due to detected anomalies.

GPU card fencing deactivation

Yes

  • Type: Normal

  • Reason: NvidiaDeviceRecovered

  • Message: GpuIds=xxx;MSG=nvidia device has recovered from the fault.

The GPU card has recovered from the anomaly, and the fencing is deactivated.

FAQ

How do I disable the automatic GPU card fencing feature of NPD?

Background

When a GPU on a node becomes abnormal, ACK automatically fences it through NPD to prevent tasks from being scheduled to it. However, automatic fencing does not perform automatic repair. The node instance continues to incur charges even after its GPU is fenced. You must still manually restart or repair the node. Configure GPU anomaly alerts to enable prompt handling.

  • After fencing, if the remaining GPUs on the node are insufficient for a task—for example, only seven cards remain for an eight-GPU task—the task cannot be scheduled. This may leave GPU resources idle.

  • After the GPU status returns to normal, the fence is automatically lifted.

  • To disable automatic fencing so that an abnormal GPU continues reporting its resources and remains schedulable, follow the solution below.

Solution

Note

Starting with ack-node-problem-detector version 1.2.30, you can control automatic GPU fencing using the generateNvidiaGpuIsolationFile configuration item in Component Management.

  1. Disable the automatic GPU fencing feature of NPD.

    • (Recommended) Method 1: Modify the component configuration in Component Management.

      1. On the Clusters page, click the name of the destination cluster. In the left navigation pane, click Add-ons.

      2. On the Logs and Monitoring tab, locate the ack-node-problem-detector component and perform the appropriate action based on its version.

        • Versions 1.2.24 to 1.2.29: Check for available upgrades. If version 1.2.30 or later is available, click Upgrade.

          Version 1.2.30 is in grayscale release. If you do not see version 1.2.30 or later, submit a ticket to request access.
        • Versions 1.2.30 and later: Click Configuration.

      3. On the component upgrade or configuration page, set generateNvidiaGpuIsolationFile (Generate NVIDIA GPU quarantined file) to false, and then click OK.

        Note

        If you previously used Method 2 to temporarily disable automatic GPU fencing, this setting is retained during NPD upgrades. To re-enable automatic GPU card fencing, set generateNvidiaGpuIsolationFile to true.

    • Method 2: Manually modify the configuration using YAML.

      Note

      The following method is a temporary workaround. The configuration is lost if you upgrade NPD to a version earlier than 1.2.30. You must reconfigure it after the upgrade. We recommend upgrading to version 1.2.30 or later to make this configuration persistent.

      1. Edit the NPD component YAML.

        kubectl edit ds -n kube-system ack-node-problem-detector-daemonset
      2. Set the EnabledIsolateGPU configuration to false.

        Before:

         --EnabledIsolateGPU=true

        After:

        --EnabledIsolateGPU=false
  2. Deactivate existing automatic GPU card fencing.

    To deactivate existing fencing on a GPU card, log on to the node where the XID error occurred and delete the /etc/nvidia-device-plugin/unhealthyDevices.json file. To prevent the card from being fenced again, disable the automatic fencing feature as described in the previous step.