By Wang Xiaoguang, a core member of "High Performance Storage Technology SIG".
The value of io_uring has been proven in traditional storage io scenarios. However, io_uring supports not only traditional storage io but also network io. Many developers in the io_uring community have tried to apply io_uring in network applications.
In the io_uring community, the debate on which is better, io_uring or epoll, remains. Some developers claim that io_uring can get better performance than epoll, while others claim that the performance of the two is equal, and there are some who claim io_uring is not even as good as epoll. We won't elaborate these points in this article, but you are welcome to read more about the discussions on these links:
https://github.com/axboe/liburing/issues/189
https://github.com/frevib/io_uring-echo-server/issues/8
The discussions have been ongoing since August 2020, and the process is long and intense. It is indeed difficult to reach a consensus on which one is better in the field of network programming.
At present, many businesses want to apply io_uring in network scenarios, but they are doubtful whether io_uring can improve performance over epoll. To clarify this problem, the OpenAnolis Community High Performance Storage SIG attempts to analyze the performance differences between the two programming frameworks of io_uring and epoll by quantifying the time-consuming operations from the perspective of quantitative analysis.
The echo server model for performance evaluation is still used. The server uses the single-threaded model. For a fair comparison, io_uring does not use the internal io-wq mechanism (the thread pool maintained by io_uring in the kernel state can be used to execute io requests submitted by users). epoll uses send(2) and recv(2) to read and write data, while io_uring uses IORING_OP_SEND and IORING_OP_RECV to do so.
Combined with the model of echo server, we find that four factors affect the performance of io_uring and epoll:
At the same time, in this article, only the overhead of read and write operations of io_uring and epoll requests are evaluated. The event notification mechanism of io_uring and epoll is not measured, since the overhead of read and write requests accounts for the majority through perf tool analysis. The system call user state to kernel state context switching overhead can be measured by a special program. Factors 2, 3, 4, etc. can be measured by measuring the execution time of kernel-related functions, by using bpftrace.
From the perspective of the user state, the overhead of send(2) or recv(2) includes two aspects: 1. The overhead of switching the context from the user state to the kernel state of the system call; 2. The overhead of the working logic of the system call itself. In the latter , send(2) and recv(2) measuresys_sendto() and sys_recvfrom() respectively.
In the epoll scenario, the batch of the system call is 1. Therefore, the average duration of sending and receiving requests in the epoll model is (s + w).
Io_uring_enter(2) system call in io_uring can be used to submit sqe or reap cqe. The two operations are mixed in one system call. It is difficult to accurately measure the time taken to submit and receive requests for sqe. We simply use the overhead of tracking io_submit_sqes() to measure the overhead of IORING_OP_SEND and IORING_OP_RECV. This function is called by io_uring_enter(2). The io_submit_sqes() includes the logic overhead on the send(2) and revc(2) kernel side and the overhead of the io_uring framework, which are recorded as t.
At the same time, the multi-shot mode of io_uring is adopted to ensure that the submitted IORING_OP_SEND and IORING_OP_RECV requests in the io_submit_sqes() can be completed without using the task-work mechanism of io_uring.
In the io_uring scene, the execution of batch system call can be performed. Therefore, the average duration of sending and receiving requests in the io_uirng model is (s + t) / n.
The environment Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz is tested to measure the echo server single-link performance data.
The cpu vulnerability impacts the context switching of the user state of the system call. In the test environment, when copper leakage mitigation is enabled, the context switching overhead of the system call is about 700ns. When copper leakage mitigation is disabled, the context switching overhead of system calls is about 230ns.
Use the bpftrace script to measure sys_sendto() and sys_recvfrom() separately. The following is the bpftrace script:
BEGIN
{
@start = 0;
@send_time = 0;
@send_count = 0;
}
kprobe:__sys_sendto
/comm == "epoll_echo_serv"/
{
@start = nsecs;
}
kprobe:__sys_recvfrom
/comm == "epoll_echo_serv"/
{
@start = nsecs;
}
kretprobe:__sys_sendto
/comm == "epoll_echo_serv"/
{
if (@start > 0) {
@delay = nsecs - @start;
@send_time = @delay + @send_time;
@send_count = @send_count + 1;
}
}
kretprobe:__sys_recvfrom
/comm == "epoll_echo_serv"/
{
if (@start > 0) {
@delay = nsecs - @start;
@send_time = @delay + @send_time;
@send_count = @send_count + 1;
}
}
interval:s:5
{
printf("time: %llu\n", @send_time / @send_count);
@send_time = 0;
@send_count = 0;
}
In a single-connection and 16-byte packet size scenario, the tps of the epoll version of echo_server is around 1000, and the following is the average kernel side logic overhead of recv(2) and send(2):
time: 1489, time: 1492, time: 1484, time: 1491, time: 1499, time: 1505, time: 1512, time: 1528, time: 1493, time: 1509, time: 1495, time: 1499, time: 1544
The data shows that the average overhead of the kernel side of send(2) and recv(2) is about 1500ns. Therefore,
1) When cpu vulnerabilities mitigate, the average overhead of send(2) and recv(2) is s=700ns, w=1500ns, (s + w) = 2200ns.
2) When cpu vulnerabilities do not mitigate, the average overhead of send(2) and recv(2) is s=230ns, w=1500ns, (s + w) = 1730ns.
Use the bpftrace script to measure the io_submit_sqes() overhead.
BEGIN
{
@start = 0;
@send_time = 0;
@send_count = 0;
}
kprobe:io_submit_sqes
/comm == "io_uring_echo_s"/
{
@start = nsecs;
@send_count = @send_count + arg1;
}
kretprobe:io_submit_sqes
/comm == "io_uring_echo_s"/
{
if (@start > 0) {
@delay = nsecs - @start;
@send_time = @delay + @send_time;
}
}
interval:s:5
{
printf("time: %llu\n", @send_time / @send_count);
@send_time = 0;
@send_count = 0;
}
The following are the data running the same test as epoll:
time: 1892, time: 1901, time: 1901, time: 1882, time: 1890, time: 1936, time: 1960, time: 1907, time: 1896, time: 1897, time: 1911, time: 1897, time: 1891, time: 1893, time: 1918, time: 1895, time: 1885
The data shows that the average kernel side overhead of the io_submit_sqes() is about 1900ns. Note: The batch is n=1, and the overhead includes the kernel state logic overhead for sending and receiving requests and the io_uring framework overhead.
1) When cpu vulnerabilities mitigate, the average overhead of io_uring_enter(2) observed in the user state is t=1900ns, n=1, s=700ns, (t + s) / n = 2600ns.
2) When cpu vulnerabilities do not mitigate, the average overhead of io_uring_enter(2) observed in the user state is t=1900ns, n=1, s=230ns, (t + s) / n = 2130ns.
Note: Only io_submit_sqes is traced, and io_uring_enter(2) system call is to call io_submit_sqes. Thus, the overhead of io_uring_enter(2) is greater than (t+s) / n.
From the preceding data, it is found that cpu vulnerabilities impact the performance of system calls, especially for small data packets. The followings are separate discussions.
epoll: s + w, io_uring: (t + s) / n. As such, even if the batch is expanded, the performance of io_uring is not as good as that of epoll since t is greater than w.
epoll: s + w, io_uring: (t + s) / n. As such, due to the large s , io_uring is not as good as epoll when the batch is low. However, when the batch is large, the system call context switching overhead is diluted in the io_uring scenario. The performance of io_uring is better than epoll. In the actual test, the throughput of io_uring is about 10% higher than that of epoll when 1000 connects, which conforms to our modeling.
From the quantitative analysis, it can be seen that the superiority of io_uring and epoll is determined by four variables defined in the evaluation model:
epoll: s + w
io_uring: (t + s) / n
If a variable is dominant, the performance data will be different. For example, if the system call context switching overhead is large s and the io_uring batch n is also large, the performance of io_uring in this scenario is better than epoll's. Another example is that if the system kernel side overhead is large w, the performance of io_uring and epoll will be close.
Therefore, the superiority of io_uring or epoll depends on their application scenes. It is recommended that the best practice is to simplify it to the echo server model based on the real network model of the business and run our measurement scripts. It allows users to evaluate the performance of both in real environments to guide real application development. At the same time, the preceding measurement data provide a direction for performance optimization. The overhead of a variable can be reduced as much as possible to improve performance. For example, the overhead of the io_uring framework can be further optimized.
High-performance storage technology SIG is committed to improving the performance of storage stacks, creating a standard high-performance storage technology software stack, and promoting the collaborative development of software and hardware.
Welcome to learn more and join SIG at https://openanolis.cn/sig/high-perf-storage
Coolbpf Is Open-Source! The Development Efficiency of the BPF Program Increases a Hundredfold
85 posts | 5 followers
FollowAlibaba Cloud Community - May 12, 2023
Alibaba Cloud Community - December 2, 2022
Alibaba Cloud Community - January 5, 2023
block - September 14, 2021
OpenAnolis - September 1, 2023
Alibaba Clouder - April 13, 2020
85 posts | 5 followers
FollowPlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreMore Posts by OpenAnolis