This topic describes how to resolve the issue that Oops exceptions occur when you hot-unplug virtio devices from Elastic Compute Service (ECS) instances that run specific recent kernel versions.
Problem description
When you hot-unplug virtio devices such as disks and network interface controllers (NICs) from ECS instances that run specific recent kernel versions, the following Oops exceptions occur:
Kernel panics occur on ECS instances whose
kernel.panic_on_oops
parameter is set to 1.The kernel on ECS instances whose
kernel.panic_on_oops
parameter is set to 0 becomes unresponsive.
kernel.panic_on_oops
is a Linux kernel parameter that is used to control the behavior of the kernel when the kernel encounters an Oops exception. Oops is an error that the kernel throws when an exception such as accessing an invalid memory address occurs.
If a kernel panic occurs, the system immediately stops all ongoing tasks, saves specific debugging information, and then restarts or shuts down to resolve issues and reduce potential impacts.
If the kernel becomes unresponsive, the kernel may attempt to continue operation. To prevent data loss or other negative impacts, we recommend that you do not set the kernel.panic_on_oops parameter to 0 in production environments.
Cause
The Linux upstream community adds support for the admin virtqueue of virtio devices. For more information, see Commit.
In the commit:
The is_avq function pointer is added to the virtio_pci_device definition to determine whether the admin virtqueue exists.
Values for the is_avq function pointer are added in the virtio_pci_modern_probe function that is used to initialize modern virtio devices.
When you hot-unplug a virtio device, the code checks whether the current queue is the admin virtqueue.
The Linux upstream community overlooks the issue that if the virtio device is a legacy virtio device, instead of a modern virtio device, and no value is assigned to the is_avq function pointer, the is_avq function pointer in the virtio_pci_device structure of the legacy virtio device is a null pointer. When you hot-unplug the virtio device, if the code invokes if (vp_dev->is_avq(vdev, vq->index))
, the instruction pointer register (RIP) of the CPU points to a null pointer address. In this case, a null pointer exception is thrown, which indicates that the system attempts to execute an invalid memory address. As a result, programs crash or a system error occurs.
Scope of impacts
Linux upstream community
The Linux upstream community already resolved the issue. For more information, see Commit. In the commit, a patch is provided to check whether the is_avq function pointer is null.
Operating systems
Ubuntu 24
Operating systems that have kernel versions close to 6.8, provide the admin virtqueue capabilities (virtio-pci: Introduce admin virtqueue), and do not have the virtio-pci: Check if is_avq is NULL patch installed to resolve the is_avq function pointer issue.
NoteYou can run the
uname -r
command to view the kernel version.
Virtio devices
Legacy virtio devices that are used on ECS instances and are hot-unplugged.
Solutions
Solution 1: Change the instance families of the ECS instances on which modern virtio devices reside. For more information, see Change instance types.
We recommend that you use the following instance families. Modern virtio devices are not affected by the issue described in this topic when the devices are used on instances of the following instance families:
ecs.c8i, ecs.g8i, ecs.r8i, ecs.c8ae, ecs.g8ae, ecs.r8ae, ecs.c8a, ecs.g8a, and ecs.r8a. For more information, see Overview of instance families.
Solution 2
Upgrade to the latest kernel software packages and verify that the virtio-pci: Check if is_avq is NULL patch is included in the latest kernel software packages to check whether the is_avq function pointer is null.
(Optional) If the preceding patch is not included in the latest kernel software packages, install the patch.
Appendix: Terms
The following table describes the terms that are used in this topic.
Term | Description |
virtio device | Virtio is a standardized framework that allows virtual machines to efficiently communicate with virtual hardware on hosts. Virtio devices are hardware devices, such as disks and NICs, that are emulated in virtualized environments. Virtio devices are categorized into legacy virtio devices and modern virtio devices. Legacy virtio devices and modern virtio devices use different configuration interfaces. |
admin virtqueue. | A special virtio queue that is used to manage and operate devices, such as obtaining device status and configuring devices. The admin virtqueue is not supported by all virtio devices. |
virtio_pci_device | The data structure that is used in the kernel to indicate a virtio Peripheral Component Interconnect (PCI) device. This data structure contains pointers to various functions, such as the is_avq function pointer, which is added to determine whether a specific virtio queue is the admin virtqueue. |
is_avq | The function that is used to determine whether a specific virtio queue is the admin virtqueue. |
virtio_pci_modern_probe | The function that is used to detect and initialize virtio PCI devices. After a device is detected, this function is invoked to configure the device, including reading configuration space, checking device features, and allocating required resources. |
RIP | A register in x86 CPUs that stores the address of the next instruction to be executed. If a program encounters an exception, such as when the program attempts to execute an instruction at an address to which a null pointer is pointed, the RIP points to the address of the instruction that caused the exception. |