All Products
Search
Document Center

Elastic Compute Service:What do I do if an Oops exception occurs when I hot-unplug a virtio device from an ECS instance that runs a recent kernel version?

Last Updated:Jun 18, 2024

This topic describes how to resolve the issue that Oops exceptions occur when you hot-unplug virtio devices from Elastic Compute Service (ECS) instances that run specific recent kernel versions.

Problem description

When you hot-unplug virtio devices such as disks and network interface controllers (NICs) from ECS instances that run specific recent kernel versions, the following Oops exceptions occur:

  • Kernel panics occur on ECS instances whose kernel.panic_on_oops parameter is set to 1.

  • The kernel on ECS instances whose kernel.panic_on_oops parameter is set to 0 becomes unresponsive.

Note

kernel.panic_on_oops is a Linux kernel parameter that is used to control the behavior of the kernel when the kernel encounters an Oops exception. Oops is an error that the kernel throws when an exception such as accessing an invalid memory address occurs.

  • If a kernel panic occurs, the system immediately stops all ongoing tasks, saves specific debugging information, and then restarts or shuts down to resolve issues and reduce potential impacts.

  • If the kernel becomes unresponsive, the kernel may attempt to continue operation. To prevent data loss or other negative impacts, we recommend that you do not set the kernel.panic_on_oops parameter to 0 in production environments.

Cause

The Linux upstream community adds support for the admin virtqueue of virtio devices. For more information, see Commit.

In the commit:

  • The is_avq function pointer is added to the virtio_pci_device definition to determine whether the admin virtqueue exists.

  • Values for the is_avq function pointer are added in the virtio_pci_modern_probe function that is used to initialize modern virtio devices.image.png

  • When you hot-unplug a virtio device, the code checks whether the current queue is the admin virtqueue.

    image.png

The Linux upstream community overlooks the issue that if the virtio device is a legacy virtio device, instead of a modern virtio device, and no value is assigned to the is_avq function pointer, the is_avq function pointer in the virtio_pci_device structure of the legacy virtio device is a null pointer. When you hot-unplug the virtio device, if the code invokes if (vp_dev->is_avq(vdev, vq->index)), the instruction pointer register (RIP) of the CPU points to a null pointer address. In this case, a null pointer exception is thrown, which indicates that the system attempts to execute an invalid memory address. As a result, programs crash or a system error occurs.

Scope of impacts

  • Linux upstream community

    The Linux upstream community already resolved the issue. For more information, see Commit. In the commit, a patch is provided to check whether the is_avq function pointer is null.

  • Operating systems

  • Virtio devices

    Legacy virtio devices that are used on ECS instances and are hot-unplugged.

Solutions

  • Solution 1: Change the instance families of the ECS instances on which modern virtio devices reside. For more information, see Change instance types.

    We recommend that you use the following instance families. Modern virtio devices are not affected by the issue described in this topic when the devices are used on instances of the following instance families:

    ecs.c8i, ecs.g8i, ecs.r8i, ecs.c8ae, ecs.g8ae, ecs.r8ae, ecs.c8a, ecs.g8a, and ecs.r8a. For more information, see Overview of instance families.

  • Solution 2

    1. Upgrade to the latest kernel software packages and verify that the virtio-pci: Check if is_avq is NULL patch is included in the latest kernel software packages to check whether the is_avq function pointer is null.

    2. (Optional) If the preceding patch is not included in the latest kernel software packages, install the patch.

Appendix: Terms

The following table describes the terms that are used in this topic.

Term

Description

virtio device

Virtio is a standardized framework that allows virtual machines to efficiently communicate with virtual hardware on hosts. Virtio devices are hardware devices, such as disks and NICs, that are emulated in virtualized environments. Virtio devices are categorized into legacy virtio devices and modern virtio devices. Legacy virtio devices and modern virtio devices use different configuration interfaces.

admin virtqueue.

A special virtio queue that is used to manage and operate devices, such as obtaining device status and configuring devices. The admin virtqueue is not supported by all virtio devices.

virtio_pci_device

The data structure that is used in the kernel to indicate a virtio Peripheral Component Interconnect (PCI) device. This data structure contains pointers to various functions, such as the is_avq function pointer, which is added to determine whether a specific virtio queue is the admin virtqueue.

is_avq

The function that is used to determine whether a specific virtio queue is the admin virtqueue.

virtio_pci_modern_probe

The function that is used to detect and initialize virtio PCI devices. After a device is detected, this function is invoked to configure the device, including reading configuration space, checking device features, and allocating required resources.

RIP

A register in x86 CPUs that stores the address of the next instruction to be executed. If a program encounters an exception, such as when the program attempts to execute an instruction at an address to which a null pointer is pointed, the RIP points to the address of the instruction that caused the exception.