Learn about the Latest Version of SysAK

This article is an excerpt from a speech about SysAK from the 2022 Apsara Conference, explaining the core technical structure and usage scenarios of the latest version.

By System O&M SIG

Recently, at the Apsara Conference 2022 OpenAnolis Forum-eBPF & Linux Stability Session, Yi Zhang (the maintainer of Anolis OS Operation and Maintenance (O&M) SIG) gave a keynote speech on the SysAK System Operation and Maintenance Tool Set. The following article highlights the main points of that speech.

Hello everyone, during Apsara Conference 2021, we open-sourced the system operation and maintenance tool set SysAK in the OpenAnolis and provided a variety of diagnostic functions. As one of the main projects of System O&M SIG (Special Interest Group), over the past year, SysAK also made more upgrades to its technical architecture and application scenarios to adapt to more scenarios. Today, I will share the core technical structure and usage scenarios of the latest version. Due to time limits, I will focus on the relevant components of the monitoring mode. I will use the enhanced features of Anolis OS and SysAK to monitor difficult problems and system health.

1. SysAK Framework Introduction

SysAK is called System Analyse Kit. Based on millions of servers' O&M experience, we provide a comprehensive system O&M tool set, covering common O&M scenarios such as daily system monitoring, online problem diagnosis, and system fault repair. It mainly includes three aspects:

System Monitoring: It provides more refined resource monitoring for various system resources, such as CPU, memory, network, file IO, and kernel management structure. This helps you implement fine-grained O&M scheduling and efficiently use resources.
System Diagnosis: Typical problems diagnosed are as follows, load exceptions, network jitters, memory leaks, IO glitches, performance bottlenecks, and application exceptions. It provides targeted tools and minimizes the professionalism of the tools to make them easier to use and interpret.
System Intervention: It provides system intervention capabilities mainly for fault injection, system recovery, and fault isolation.

The SysAK framework includes two major modes: monitoring mode and diagnostic mode. The system resource bottleneck metrics include CPU bottlenecks, memory bottlenecks, network bottlenecks, and IO bottlenecks. You can monitor the bottlenecks to find the dependency on resources during application running and then properly schedule and allocate resources to applications based on the dependency and other data.

In addition to the four major hardware resources, the system software itself also has bottlenecks. For example, the Linux kernel system may have concurrent bottlenecks in the process of accessing various files, handles, caches, and shared resources. SysAK has also done a lot of work to address the bottleneck.

Interference is a common factor during application running, which can cause jitter or interruption. For the cloud-native trend, SysAK implements container resource visualization.

Diagnostic Mode: It detects problems in a timely manner, diagnoses them based on the root cause of the problem, and starts as usage. The following three services are supported based on user O&M scenarios:

System Load Analysis: A typical problem happens during system O&M when the system is under load. You can analyze the root cause of this problem, avoiding influencing the process stack.
One-Click System Health Diagnosis: For example, you can analyze system resources to check whether the configurations are reasonable.
Automatic IO Problem Diagnosis: For example, if IO is full, it helps find the reason: application bottleneck or the underlying storage bottleneck of the business.

In addition to user scenarios, we provide deeper data diagnosis for advanced technical support personnel (such as functions that take a long time to call system data, interrupted operation statistics, scheduling modules, memory modules, latency jitter, and memory leaks). We will do a special function diagnosis based on the characteristics of each subsystem.

SysAK helps developers of these tools integrate mainstream architectures and operating system versions without additional work through loose coupling, dependency management, and multi-architecture and multi-version building support.

2. SysAK Monitoring Scenario Application

The mservice of SysAK provides three major capabilities: resource monitoring, exception alert, and root cause analysis. The exception alert function sets a special threshold, provides alerts, and performs automatic analysis.

SysAK can use enhanced metrics to monitor the use of container resources, mainly relying on the enhanced features of the Anolis OS kernel and the expansion of SysAK itself.

Computing Resource: It includes the container load and the number of running and blocked tasks.

Memory Resource: Bottlenecks may frequently occur during memory usage, and enhanced monitoring is mainly for latency. Memory reclaim latency includes global memory reclaim and container memory reclaim, affecting the service running status of containers. Therefore, we make statistics on the distribution of reclaim latency and the number of times of regulation. Based on the results, we determine whether the container business encounters bottlenecks.

IO Resource: It includes the wait time, number of queues, and average number of bytes in container read and write.

Jitter is an occasional problem in daily O&M. However, it is difficult to collect actual root cause data in the occasional process. If there is too much data collection, the overall system performance will be affected. However, too little collection is not enough to analyze the root cause of the problem. The causes of service jitter can be summarized into the following three causes:

Delay in Process/Thread Scheduling: For example, running queue squeezing, long queuing time, high-priority application preemption, or improper scheduling policy settings
Untimely Response to Interrupt and softIRQ: The business running process depends on the interrupt and softIRQ execution process, including network packet forwarding and receiving and IO read and write. Therefore, the interrupt duration can be analyzed to determine the response time of the interrupt.
Too Long Kernel Mode Execution: This includes bottlenecks in the system and competition for other resources in the kernel.

The three causes above can cover the root causes of 70%-80% of jitter. Therefore, detection on them can virtually solve the jitter problem.

SysAK is also enhanced for system health alerts.

For example, the application does not jitter, but suddenly slows down, which will cause the system to enter an unavailable state in the long run, such as downtime. Downtime causes great impacts, and most of them are not recoverable. Before that, an early warning can be made by various means. For example, check the impact metrics of downtime through the algorithm to determine whether the downtime will occur, and do the health prediction in advance. The main metrics include scheduling latency, kernel lock contention latency, and memory reclaim latency.

Combined with past experience, we set the current exception reference threshold at 50%.

SysAK is mainly used for standalone diagnosis and monitoring. In addition to using SysAK mservice commands to directly view data on the machine, SysAK supports providing data services in the form of http ports, as shown in the figure above. A graphical display based on the data is also supported.

3. Future Development

In the future, in addition to improving the usage scenarios of the tool itself, we will continue to enhance other capabilities of SysAK. Currently, SysAK can only perform diagnosis at the system level. In the future, we will consider performing diagnostics at the application level to provide more data for application diagnosis.

In addition, SysAK is open-source in OpenAnolis. We hope more developers will join us to develop O&M. We also hope the SysAK tool will continue to develop as a technical data collection feature for O&M platforms. Therefore, we will focus on platform plug-ins. Currently, it has been used as a component of SysOM and CloudMonitor. In the future, it will be used as a plug-in extension of Prometheus to meet more scenarios.