By Cloud Kernel Developers of OpenAnolis.
Plugsched is a SDK that enables live updating the Linux kernel scheduler. It can dynamically replace the scheduler subsystem without rebooting the system or applications, with milliseconds downtime. Plugsched can help developers to dynamically add, delete and modify kernel scheduling features in the production environment, which allows customizing the scheduler for different specific scenarios. The live-update capability also enables rollback.
The live-update of the scheduler based on Plugsched can obtain better modifiable capabilities without modifying the kernel source code and support the old online kernel version. If the reserve fields are added to the key data structures of the kernel scheduler code in advance, the ability to modify the data structure can be obtained, improving the modifiable capability.
What is the background of Plugsched or the problem it wants to solve? There are four points:
Plugsched can extract the scheduler from the kernel as a kernel module to update the built-in scheduler in the kernel. New features and optimizations could be developed more agilely using the module, and can be applied to production environment without stopping businesses.
Figure 1: Plugsched Business without Interruption
Plugsched has the following six advantages:
Compared with kpatch and livepatch, plugsched has stronger modification ability and a wider range of live update such as a subsystem, but the kpatch and livepatch are live update techniques of function granularity. Plugsched is capable of bugfix, performance optimization, or feature addition, deletion, or modification. Due to the strong modifiable capability, it can be applied to the following scenarios:
Group Identity is a scheduling feature used by Alibaba Cloud in hybrid scenarios. Based on the CFS scheduler, it adds a Red-black tree to store low-priority tasks and assigns a default priority to each cgroup. Users can configure their priorities. When high-priority tasks exist in the queue, low-priority tasks stop running. We use Plugsched to live update the scheduler of an old version kernel of the anck 4.19 (without Group Identity) and transplant the Group Identity to the generated scheduler module, involving seven files and 2500 + line modifiers.
After installing this scheduler module, create two CPU cgroups in the system (cgroup A and B), bind the same CPU, set the highest and lowest priorities respectively, and create a busy loop task respectively. Theoretically, when a task is executed in cgroup A, the task in B stops running. Then, use the top tool to check the CPU utilization rate and find that there is only one busy loop task whose utilization rate is 100%, indicating that the Group Identity feature in the module has taken effect. However, after the module is dynamically removed, two busy loop tasks each occupy 50% CPU, indicating that the module is invalid.
The old version of the kernel used by an Alibaba Cloud customer has excessive CPU utilization due to the unreasonable load statistics algorithm of the kernel scheduler. The bugfixes have been merged into the kernel mainline, but the new kernel version has not been released, and the business side does not intend to update the kernel. This is because a large number of businesses are deployed in the cluster, and the cost of updating the kernel is high.
Besides, the customer's kernel developers have targeted optimization for their mixed business scenarios (Group Identity scheduling features) and want to merge the optimization content into the kernel mainline. Alibaba Cloud kernel developers found that the optimized content has performance regression in other scenarios and is a non-generic optimization. It is not allowed to merge the optimized content into the mainline.
As a result, the customer's kernel developers used Plugsched to port all the optimized fixes to the scheduler module and deploy them on a large scale. This case can reflect the advantages of Plugsched decoupling from kernel release and customized scheduler.
Plugsched currently supports Anolis OS 7.9 ANCK by default, and other OS need to adjust the boundary configrations. In order to reduce the complexity of building a running environment, we provide container images and Dockerfiles, and developers do not need to build a development environment by themselves. For convenience, we purchased an Alibaba Cloud ECS (64CPU + 128GB) and installed the Anolis OS 7.9 ANCK. We will live update the kernel scheduler.
1. Log into the cloud server, and install some neccessary basic software packages.
# yum install anolis-repos -y
# yum install podman kernel-debuginfo-$(uname -r) kernel-devel-$(uname -r) --enablerepo=Plus-debuginfo --enablerepo=Plus -y
2. Create a temporary working directory and download the source code of the kernel.
# mkdir /tmp/work
# uname -r
4.19.91-25.2.an7.x86_64
# cd /tmp/work
# wget https://mirrors.openanolis.cn/anolis/7.9/Plus/source/Packages/kernel-4.19.91-25.2.an7.src.rpm
3. Startup the container, and spawn a shell.
# podman run -itd --name=plugsched -v /tmp/work:/tmp/work -v /usr/src/kernels:/usr/src/kernels -v /usr/lib/debug/lib/modules:/usr/lib/debug/lib/modules docker.io/plugsched/plugsched-sdk
# podman exec -it plugsched bash
# cd /tmp/work
4. Extract kernel source code.
# plugsched-cli extract_src kernel-4.19.91-25.2.an7.src.rpm ./kernel
5. Boundary analysis and extraction.
# plugsched-cli init 4.19.91-25.2.an7.x86_64 ./kernel ./scheduler
6. The extracted scheduler code is in ./scheduler/kernel/sched/mod. Add a new sched_feature and package it into a rpm.
diff --git a/scheduler/kernel/sched/mod/core.c b/scheduler/kernel/sched/mod/core.c
index 9f16b72..21262fd 100644
--- a/scheduler/kernel/sched/mod/core.c
+++ b/scheduler/kernel/sched/mod/core.c
@@ -3234,6 +3234,9 @@ static void __sched notrace __schedule(bool preempt)
struct rq *rq;
int cpu;
+ if (sched_feat(PLUGSCHED_TEST))
+ printk_once("I am the new scheduler: __schedule\n");
+
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
diff --git a/scheduler/kernel/sched/mod/features.h b/scheduler/kernel/sched/mod/features.h
index 4c40fac..8d1eafd 100644
--- a/scheduler/kernel/sched/mod/features.h
+++ b/scheduler/kernel/sched/mod/features.h
@@ -1,4 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
+SCHED_FEAT(PLUGSCHED_TEST, false)
+
/*
* Only give sleepers 50% of their service deficit. This allows
* them to run sooner, but does not allow tons of sleepers to
# plugsched-cli build /tmp/work/scheduler
7. Copy the scheduler rpm to the host, exit the container, and view the current sched_features.
# cp /usr/local/lib/plugsched/rpmbuild/RPMS/x86_64/scheduler-xxx-4.19.91-25.2.an7.yyy.x86_64.rpm /tmp/work
# exit
exit
# cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS NO_WA_STATIC_WEIGHT UTIL_EST ID_IDLE_AVG ID_RESCUE_EXPELLEE NO_ID_EXPELLEE_NEVER_HOT NO_ID_LOOSE_EXPEL ID_LAST_HIGHCLASS_STAY
8. Install the scheduler rpm and then the new feature is added but closed.
# rpm -ivh /tmp/work/scheduler-xxx-4.19.91-25.2.an7.yyy.x86_64.rpm
# lsmod | grep scheduler
scheduler 503808 1
# dmesg | tail -n 10
[ 2186.213916] cni-podman0: port 1(vethfe1a04fa) entered forwarding state
[ 6092.916180] Hi, scheduler mod is installing!
[ 6092.923037] scheduler: total initialization time is 6855921 ns
[ 6092.923038] scheduler module is loading
[ 6092.924136] scheduler load: current cpu number is 64
[ 6092.924137] scheduler load: current thread number is 667
[ 6092.924138] scheduler load: stop machine time is 249471 ns
[ 6092.924138] scheduler load: stop handler time is 160616 ns
[ 6092.924138] scheduler load: stack check time is 85916 ns
[ 6092.924139] scheduler load: all the time is 1097321 ns
# cat /sys/kernel/debug/sched_features
NO_PLUGSCHED_TEST GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS NO_WA_STATIC_WEIGHT UTIL_EST ID_IDLE_AVG ID_RESCUE_EXPELLEE NO_ID_EXPELLEE_NEVER_HOT NO_ID_LOOSE_EXPEL ID_LAST_HIGHCLASS_STAY
9. Open the new feature and we can see "I am the new scheduler: __schedule" in dmesg.
# echo PLUGSCHED_TEST > /sys/kernel/debug/sched_features
# dmesg | tail -n 5
[ 6092.924138] scheduler load: stop machine time is 249471 ns
[ 6092.924138] scheduler load: stop handler time is 160616 ns
[ 6092.924138] scheduler load: stack check time is 85916 ns
[ 6092.924139] scheduler load: all the time is 1097321 ns
[ 6512.539300] I am the new scheduler: __schedule
10. Remove the scheduler rpm and then the new feature will be removed.
# rpm -e scheduler-xxx
# dmesg | tail -n 8
[ 6717.794923] scheduler module is unloading
[ 6717.809110] scheduler unload: current cpu number is 64
[ 6717.809111] scheduler unload: current thread number is 670
[ 6717.809112] scheduler unload: stop machine time is 321757 ns
[ 6717.809112] scheduler unload: stop handler time is 142844 ns
[ 6717.809113] scheduler unload: stack check time is 74938 ns
[ 6717.809113] scheduler unload: all the time is 14185493 ns
[ 6717.810189] Bye, scheduler mod has be removed!
#
# cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS NO_WA_STATIC_WEIGHT UTIL_EST ID_IDLE_AVG ID_RESCUE_EXPELLEE NO_ID_EXPELLEE_NEVER_HOT NO_ID_LOOSE_EXPEL ID_LAST_HIGHCLASS_STAY
Note: Cannot unload the scheduler module directly using the "rmmod" command! You should use the "rpm or yum" standard command to remove the scheduler package.
Now, we know what Plugsched is and its application case, but how does it work?
The scheduler subsystem is built into the kernel, not an independent module. And it's highly coupled to other parts of the kernel. Plugsched takes advantage of the idea of modularization: it provides a boundary analyzer that determines the boundary of the scheduler subsystem and extracts the scheduler from the kernel code into a separate directory. Developers can modify the extracted scheduler code and compile it into a new scheduler module and dynamically replace the old scheduler in the running system. The boundary analysis and code extraction of the subsystem need to process functions and data, and then generate an independent module.
For functions, the scheduler module exports some key functions, which can be entered into the module through these functions, which are called interface functions. By replacing these functions in the kernel, the kernel can bypass the original execution logic and enter the new scheduler module, thereby completing the function update. Functions compiled in the scheduler module are either interface functions, or insiders. Other functions are all called outsiders.
For data, the important data of scheduler, such as runqueue state and sched class state, can be automatically reinitialized using state rebuild technology, and these data are private data, while others are shared data. Plugsched allows users to manually define the private data for flexibility, which retains definitions of these data in the module but requires initialization.
plugsched classifies struct fields which is accessed only by the scheduler as inner-fields, others as non-inner-fields. The scheduler module allows modifying the semantics of inner fields, and forbids to modify the semantics of non-inner fields. And the scheduler module even allows modifing the size of the whole data structure if all fileds are inner fileds. Last but most important, we recommend using reserved fields of data structures, rather than modifying existing ones.
Plugsched mainly consists of two parts, and the first one is the boundary analysis and code extraction of the scheduler subsystem, and the second one is the live updating of the scheduler module, which are the core of whole design. The design architecture is shown in as follows:
Figure 2: Architecture of Plugsched
The first is the scheduler module boundary analysis and code extraction. The scheduler itself is not a module, so it is necessary to determine the boundary of the scheduler for modularization. The boundary analyzer extracts the scheduler code from the kernel source code to the specified directory as the code base according to the boundary configuration information (includes source code files, the interface functions, etc). And then, developers can modify the code and customize the scheduler. Finally,scheduler module is compiled and packaged to a RPM, which can be installed into the system. After installation, the module will replace the original scheduler built in the kernel. The installation will go through the following key steps.
On the whole, Plugsched frees the scheduler from the kernel. Developers can customize the scheduler specifically, not being limited to the kernel generic scheduler. Kernel maintenance becomes easier since developers only need to pay attention to the development and iteration of the generic scheduler, and the customized scheduler can be released using RPM packages. Kernel scheduler code will become clear, and be no longer confused with scenarios of optimization.
Plugsched will support new versions of the kernel and other platforms, optimize its ease of use, and provide more application cases in the future. And, welcome to Plugsched!
85 posts | 5 followers
FollowOpenAnolis - June 1, 2023
OpenAnolis - July 27, 2023
Alibaba Cloud Native Community - March 29, 2023
OpenAnolis - May 27, 2022
OpenAnolis - January 11, 2024
Alibaba Cloud Community - May 12, 2023
85 posts | 5 followers
FollowAlibaba Cloud Linux is a free-to-use, native operating system that provides a stable, reliable, and high-performance environment for your applications.
Learn MoreEMAS HTTPDNS is a domain name resolution service for mobile clients. It features anti-hijacking, high accuracy, and low latency.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by OpenAnolis