Is Benchmarking a Gimmick or Strength?

The authors are two core developers of OpenAnolis Kernel-SIG, who are experts in the Linux kernel scheduler.

Background: The Battle of Performance

Benchmark has become ridiculed in the mobile phone industry. To be honest, benchmark is one of the most important evaluation methods in the field of operating systems. For example, the Linux Kernel community often evaluates the value of an optimization patch by benchmarking. There are media like Phoronix that focus on Linux benchmark. Today, I would like to say one more thing. Getting a high score in benchmark is the embodiment of excellence, which is based on a deep understanding of the kernel. This article stems from a daily performance optimization analysis. When we evaluated Tuned, an automation performance tuning software, we found that it made some minor modifications to the parameters related to the Linux kernel scheduler in the server scenario. These modifications improved the performance of the Hackbench. Isn't it interesting? Let's dive in!

The content of this article is listed below:

An Introduction to Relevant Information
An Introduction to Hackbench Working Mode
The Source of Hackbench Performance Impact
Two-Parameter Optimization
Thinking and Expanding

An Introduction to Relevant Information

CFS Scheduler

Most threads /processes in Linux are scheduled by a scheduler called Completely Fair Scheduler (CFS), which is one of the core components of Linux. (In Linux, there are only subtle differences between threads and processes, which are uniformly expressed as processes later in the article.) The core of CFS is the red-black tree, which is used to maintain running time processes in the system as the basis for selecting the next process to run. In addition, it supports priority, group scheduling (based on the well-known cgroup), throttling, and other functions that can meet various advanced requirements.

Hackbench

Hackbench is a stress testing tool for Linux kernel schedulers. Its main job is to create a specified number of scheduling entity pairs (threads/processes), let them transmit data through sockets/pipes, and count the time cost of the entire running process.

CFS Scheduler Parameters

This article focuses on the following two parameters. They are important factors that affect the performance of Hackbench. The system administrator can use the sysctl command for settings.

Minimum Granularity Time: kernel.sched_min_granularity_ns

The duration of the sched period of the CFS can be affected by modifying the kernel.sched_min_granularity_ns. For example, set kernel.sched_min_granularity_ns = m. When there are a large number of runnable processes in the system, the larger the m, the longer the CFS scheduling period will be.

As shown in figure 1, each process can run on the CPU and has a different length of time. sched_min_granularity_ns ensures the minimum running time of each process (in the case of the same priority). The larger the sched_min_granularity_ns, the longer the time each process can run at a time.

Wake up Preemption Granularity: kernel.sched_wakeup_granularity_ns

It ensures the reawakened process will not preempt the running process frequently. The larger the kernel.sched_wakeup_granularity_ns, the smaller the preemption frequency.

As shown in figure 2, process-{1,2,3} are to be woken up. The running time of process-3 is greater than curr (the process running on the CPU), so it fails to preempt. Although the running time of process-2 is less than curr, but the difference between them is less than sched_wakeup_granularity_ns, so it also fails to preempt too. Only process-1 can preempt curr. Therefore, the smaller the sched_wakeup_granularity_ns, the faster the response time after the process is woken up (the shorter the waiting time).

An Introduction to Hackbench Working Mode

The Hackbench working mode is divided into process mode and thread mode. The main difference is whether to create a process or thread as the basis for testing. The following is an introduction to the thread:

Hackbench creates several threads (even number) divided into sender and receiver.
Divide it into n groups. Each group contains m pairs of sender and receiver.
The task of each sender is to send packets to all receivers of its group in turn, loop times in total, and datasize in size each time.
Receiver is responsible for receiving data packets.
Sender and receiver in the same group can communicate in two ways: pipe and local socket (only all pipe or all socket in one test). Threads between different groups have no interaction.

we can find that the thread/process in the same group is mainly I/O-intensive, and the thread/process between different groups is mainly CPU-intensive through the preceding Hackbench model analysis.

Figure 3: Hackbench Working Mode

Voluntary Context Switching:

For the receiver, when there is no data in the buffer, the receiver is blocked and volunteers to give up the CPU, and then go to sleep.
If there is not enough space in the buffer to write data, the sender will also be blocked and volunteers to give up the CPU.

Therefore, there are many voluntary context switches in the system, but there are also involuntary context switches. The latter will be affected by the parameters we will introduce below.

The Source of Hackbench Performance Impact

In the hackbench-socket test, tuned the modified parameters of the sched_min_granularity_ns and sched_wakeup_granularity_ns of CFS, resulting in performance differences. The following descriptions provide details:

Switches /Parameters and Performance	sched_min_granularity_ns	sched_wakeup_granularity_ns	Performance
Close Tuned	2.25ms	3ms	Poor
Open Tuned	10ms	15ms	Good

Next, we adjust these two scheduling parameters for further analysis.

Two-Parameter Optimization

Note: For brief expression, m refers to kernel.sched_min_granularity_ns and w refers to kernel.sched_wakeup_granularity_ns.

To explore the influence of two parameters on the scheduler, we choose to fix one parameter at a time, investigate how the other parameter affects the performance, and use system knowledge to explain the principle behind this phenomenon.

Fix sched_wakeup_granularity_ns

Figure 4: Fix w and adjust m

In the preceding figure, we fixed the parameter w and divided it into three parts according to the changing trend of the parameter m: region A (1ms~4ms), region B (4ms~17ms), and region C (17ms~30ms). In region A, all four curves show a fast downward trend. In region B, all four curves are in a state with large fluctuations, and in region C, curves tend to be stable.

In the relevant knowledge in the second section, it can be known that m affects the running time of the process, which means that it affects the involuntary context switching of the process.

For region A, preemption is frequent and most of the preemption is meaningless because no data is writable/no buffer is available at the peer end, resulting in a large number of redundant active context switching. At this time, a larger w allows the sender/receiver to have more time to write or consume data to reduce meaningless active context switching of peer processes.
For region B, with the increase of M, sufficient data can be written or read out in the buffer to meet the time requirement of the sender/receiver to execute tasks. Therefore, a smaller w is needed to increase the preemption probability of the wake-up process, so the peer process can process data faster and reduce the active context switching in the next round of scheduling.
For region C, m is large enough that involuntary context switching will hardly occur. The process will perform active context switching after the task is completed and wait for the peer process to process. At this time, m has little impact on performance.

Fix sched_min_granularity_ns

Figure 5: Fix m and adjust w

We have fixed the parameter m in the preceding figure, which is divided into three regions:

In region A, there is also the phenomenon in Figure 4. Performance when m is large is less affected by w, while performance when m is small gets better as w increases.
In region B, there are many involuntary context switches for medium-sized m (8ms/12ms) processes, and the processes in it have already processed a considerable part of the data. It is expected that the peer process can respond as soon as possible, so a larger w will affect the performance for medium-sized m.
Figures 4 and 5 show the same performance in region C. Both tend to be stable. Since wake-up preemption hardly occurs when w is large, the change of the simple w value has little effect on performance at this time. However, large w will cause performance problems for medium-sized m. (The reason is the same.)

Performance Trend Overview

The following is a thermal overview of experimental data to show the constraint relationship between m and w. The three areas will be different from those in Figures 4 and 5.

Figure 6: Overview

Optimal Two-Parameters (For Hackbench)

From the analysis in the preceding two sections, a larger m (for example, 15~20ms) can be selected for scenarios with active context switching like Hackbench.
In the pipe/socket bidirectional communication scenario, the response time of the peer will affect the next processing of the process. If you want to enable the peer process to respond in time, a medium-sized W (such as 6~8ms) can be selected to obtain higher performance.

Thinking and Expanding

In desktop scenarios, applications are interactive, and the service quality of applications is reflected in the response time of applications to user operations. Smaller sched_wakeup_granularity_ns can be selected to improve the interactivity of applications.
In the server scenario, the application is inclined to computing processing. The application needs more running time to perform intensive computing, so you can choose a larger sched_min_granularity_ns. However, you can choose a medium-sized sched_wakeup_granularity_ns to prevent a single process from occupying the CPU for too long and also to process the client request response in time.
In the Linux native kernel, the default parameters of m and w are set to adapt to desktop scenarios. Anolis OS users need to select kernel parameters according to the scenarios of their deployed applications, whether they are desktop or server, or use Tuned's recommended configuration. As an application between desktop and server, Hackbench can also be used as a reference for configuration.

Community

Is Benchmarking a Gimmick or Strength?

Background: The Battle of Performance

An Introduction to Relevant Information

CFS Scheduler

Hackbench

CFS Scheduler Parameters

An Introduction to Hackbench Working Mode

The Source of Hackbench Performance Impact

Two-Parameter Optimization

Fix sched_wakeup_granularity_ns

Fix sched_min_granularity_ns

Performance Trend Overview

Optimal Two-Parameters (For Hackbench)

Thinking and Expanding

References

Read previous post:

Read next post:

OpenAnolis

You may also like

Comments

OpenAnolis

Related Products

Alibaba Cloud Linux

Elastic High Performance Computing Solution

Elastic High Performance Computing

Remote Rendering Solution