All Products
Search
Document Center

Elastic GPU Service:Diagnose a GPU by using the self-service troubleshooting feature in the console

Last Updated:Nov 24, 2025

GPU-accelerated instances may encounter faults or security vulnerabilities, such as GPU malfunctions or driver anomalies. The Elastic Compute Service (ECS) console incorporates the troubleshooting feature that lets you perform health checks on GPU devices. This helps you diagnose whether the GPU and driver of your GPU-accelerated instance are abnormal, and identify and resolve potential problems at the earliest opportunity.

Procedure

Note

Before you perform operations, make sure that your GPU-accelerated instance is in the Running state.

  1. Go to the Self-service Troubleshooting page in the ECS console. At the top of the page, select the region where the GPU-accelerated instance is located.

  2. On the Troubleshooting page, configure the issue type, diagnostic item, instance ID, and troubleshooting cycle. Then, click Start.

    Note

    After you click Start, the system automatically creates a diagnostic task. The system runs only one diagnostic task on an instance within a specific period of time. After the diagnostic task is complete, you must wait for at least 5 minutes before you can start another diagnostic task on the instance.

    实例问题排查-zh.png

    The following table describes the configuration items.

    Serial number

    Configuration item

    Description

    Issue type

    Select Instance Device Check to check whether the instance devices, such as the GPU, run as expected.

    Diagnostic item

    Select GPU Health Check to check the status of the instance devices, such as the status of the GPU and driver.

    Instance ID

    Select the ID of the GPU-accelerated instance that you want to check.

    Troubleshooting cycle

    Specify a time period as needed. By default, the system troubleshoots issues within the most recent 12 hours.

  3. After the instance is diagnosed, view the diagnostic report.

    诊断结果.jpg

    A diagnostic report includes the following items.

    Item

    Description

    Diagnostic result

    • The system displays No exceptions are detected on the instance. if all diagnostic items are normal.

    • The system displays *** exceptions are detected on the instance. if abnormal diagnostic items exist. *** is replaced with the actual number of exceptions. The system also provides solutions that you can reference to resolve the exceptions.

    Diagnostic item details

    In this topic, the system displays only information about the GPU device and driver status check parameter. The severity levels are classified into serious, warning, and passed.

    Basic diagnostic information

    The system displays the basic diagnostic information, including the Resource ID, Report ID, and Start At parameters.

  4. (Optional) On the Troubleshooting page, click View History to view the historical diagnostic details of the instance on the Check History page.

    Note

    On the Instance Health Diagnostics tab of the Check History page, you can click the 筛选 icon on the right of the Status column to filter a desired report by state.