By System O&M SIG
On a typical day-to-day operation of a business, the system supporting the business is prone to various interferences that may cause jitters, which will inevitably affect user experience. One of the sources of the interference comes from the interrupt disable setting. When the interrupt disable setting takes a very long time, it may lead to an untimely scheduling of business processes, causing delays in data sending and receiving. This interference has been with the Linux kernel for a long time. Therefore, the Linux kernel and other products in the industry have introduced many methods for detecting the interrupt disable. Here are several typical methods:
The kernel itself adds a TRACE_IRQ_OFF trace point to all enable or disable interrupt paths in the system to monitor all enable or disable interrupts in the system.
Advantage: The detection of interrupt disable setting is accurate and comprehensive, and is able to observe the time and even stack of interrupt disable setting.
Disadvantage: This method depends on the CONFIG_IRQSOFF_TRACER kernel configuration, which is not supported in many distributions. There are too many probes and they are hot paths, which have a great impact on performance.
The kernel registers a PMU hardware event and periodically checks whether the watchdog hrtimer interrupt registered by itself updates a timestamp in time.
Advantage: It provides periodic detection, impacting a little on performance. It is suitable for cases where the interrupt disable setting or interrupt processing has bugs in the system.
Disadvantage: The granularity of monitoring is too large to provide system observation metrics at a finer-grained level. This method does not take effect for some virtualized environments without PMU hardware events.
Insert trace and time records at the entry and exit of all interrupt processing functions.
Advantage: It provides raw data of all interrupt processing durations.
Disadvantage: Only the time of interrupt processing can be observed, and other paths of interrupt disable setting cannot be monitored. There is no threshold trigger mechanism, and manual post-event analysis is required.
There are also many detection tools of interrupt disable setting on Github, which are provided in the form of .ko module. These tools use hrtimer timer to check whether the expiration time exceeds the expected time.
Advantage: Periodic sampling causes little overhead and provides monitoring granularity in milliseconds.
Disadvantage: It is provided as a third-party module, so its stability cannot be guaranteed. As a normal interrupt, hrtimer can only be detected after the interrupt disable setting is recovered, thus the accuracy cannot be guaranteed. Meanwhile, the information of the previous interrupt will be lost for the delay of the interrupt function. Each of the above detection solutions has its own advantages, as well as its own defects and unsatisfactory scenes. Let's see how SysAK develops a safe, reliable, and flexible detection tool for the interrupt disable setting with low performance overhead.
The above tools use different technologies, implementation methods and detection logic, each resulting in its own advantages and disadvantages in various scenarios. For a production environment, the stability of monitoring tools and the cost of tools are the most important factors. Therefore, using probes in the kernel hot path and using the kernel module are not ideal choices. This is because the former may have a strong impact on performance in some scenarios, while the latter requires very high coding security of the kernel module, which is inadvertently prone to system downtime, and is too risky for the production environment of batch deployment.
Are there solutions or technical means to solve the above two problems? Of course there are. For the security issue of the kernel module, extended Berkeley Packet Filter (eBPF) is definitely the best solution in the Linux industry. It is secure and provides a large number of easy-to-use libraries to improve the programming experience. However, the performance problem of probes in hot path can be solved by using timing sampling and inspection in the scene where the interrupt disable setting takes a very long time, and the perf event sampling is preferred.
The technical selection has been initially completed based on the preceding analysis of background and application scenarios, and the general principle of the detection tools of the interrupt disable setting is basically formed:
If the system supports perf hardware sampling (HW) events, eBPF is first used to start a kernel timer. The timer periodically generates interrupts and periodically updates a sign in the interrupt processing function. Then perf HW events periodically detect whether the sign is updated on time to determine whether clock interrupts occur on time or delay.
For systems that do not support perf HW events, we cannot use them. Therefore, we use eBPF to start a kernel timer. The timer interrupt function detects whether the timer interrupt expiration time differs from the expected time to determine whether the interrupt is delayed.
In fact, to realize the above features, the following problems need to be solved:
1) How does the eBPF install a timer?
2) How to hook up the logic checking to the callback processing of perf events after a perf event is triggered?
3) How to make the eBPF choose different detection mechanisms according to the support of the system for perf HW events?
In the preceding design analysis, we understand that it is required to use the eBPF to start kernel timer to do an interrupt sample. However, eBPF does not support timer creation in kernel versions earlier than Linux 5.15. The PERF_COUNT_SW_CPU_CLOCK event in the perf SW events is realized at the bottom of the Linux kernel through high-precision clocks. Therefore, as long as we skillfully use this principle and combine it with the eBPF, we can realize an eBPF timer we need.
Perf events provide perf_event_open system calls and the ioctl method to create samples of perf HW and perf SW events in the user mode (such as the PERF_COUNT_SW_CPU_CLOCK event mentioned above and the perf hardware event PERF_COUNT_HW_CPU_CYCLES, through man perf_event_open to view more detailed information). However, traditional implementation of perf in user mode cannot implement the hack action we need in the kernel after these events are triggered. For example, to execute the callback function that we need to perform interrupt detection, we can only call the perf_event_create_kernel_counter function in the kernel code or the kernel module to register the callback function we need to the overflow_handler of the perf event in order to achieve our goal.
However, this dilemma has changed with the emergence of eBPF. Perf events provide a special ioctl channel for eBPF to perform its capability in the kernel. It should be noticed that the eBPF code is still essentially executed in the kernel. However, eBPF has a natural "affinity" with the user mode, so that the user mode can easily hook its own callback processing logic to perf events through ioctl. At the code level, it uses:
ioctl(PERF_EVENT_IOC_SET_BPF)
to register the eBPF prog processing function to the overflow_handler callback processing of the perf events.
Use the return value called by the perf_event_open system to determine whether the system supports the perf HW events. Meanwhile, two sets of prog are defined in eBPF. If perf HW events are supported, the prog detected by the HW events is attached. Otherwise, the prog detected by the SW events is attached.
This section provides a detailed breakdown of the tool implementation. The following figure shows a simple schematic diagram of tool operation:
The whole process can be summarized as follows:
After SysAK is installed, run the following command:
sysak irqoff [--help] [-t THRESH(ms)] [-f LOGFILE] [duration(s)]
-t: the threshold value of the interrupt disable setting, in ms
-f: the files specifying the irqoff result record
duration: It is the runtime of the tool. If it is not specified, the tool will always run by default.
A worker is created by using the kernel module to construct a long-term scene of the interrupt disable setting. The following shows the results captured by irqoff.
TIME(irqoff) CPU COMM TID LAT(us)2022-05-05_11:45:19 3 kworker/3:0 379531 1000539
<0xffffffffc04e2072> owner_func
<0xffffffff890b1c5b> process_one_work
<0xffffffff890b1eb9> worker_thread
<0xffffffff890b7818> kthread
<0xffffffff89a001ff> ret_from_fork
The results consist of several parts:
The first line is the log header. It has five columns. From left to right, it includes timestamp (module information), long CPU of the long interrupt disable setting, long current thread ID of the interrupt disable setting, and total delay of the interrupt disable setting.
The second line corresponds to the actual information of the log header.
The third line and the rest are the information of on-site stack which captures the interrupt disable setting, which is convenient to further analyze the source code.
What Are the Highlights of Realm Confidential Computing Technology?
87 posts | 5 followers
FollowOpenAnolis - April 7, 2023
Alibaba Cloud Community - August 12, 2022
OpenAnolis - February 13, 2023
Alibaba Cloud Native Community - December 11, 2023
OpenAnolis - February 8, 2023
OpenAnolis - October 26, 2022
87 posts | 5 followers
FollowA unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.
Learn MoreManaged Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by OpenAnolis