Application Observation Practice of SysOM Profiling in the Cloud Environment

By System O&M SIG

Background Information

ECS users generally arrange some regular system metrics of monitoring and observation or business metrics in the cloud environment. Although these metrics can be used to monitor system or application exceptions, they cannot help users fully understand what the system or application is doing to cause exceptions. Common examples include:

The system CPU occasionally surges, but we do not know which application causes it.
The packet has reached the local computer, but we do not know why the application has delayed receiving packets.

When we are at a loss, we think there is a problem with the system. However, after troubleshooting the system problems, it is often found that the application is barbarously consuming the system resources, and some of these applications are businesses themselves. Others are hidden among the thousands of tasks in ps-ef, hard to spot. Therefore, we want to observe the running behavior of the system/application through profiling to help users solve problems.

Implementation Scheme

Profiling can be considered a dynamic way to observe the execution logic of a program. This program can be as large as an operating system or an infrastructure. (Interested students can read the following content [1].) It can be as small as a pod or a simple application. If you add a one-time dimension to this method and continuously make observations in the form of profiling, you can track the problems of occasional anomalies in the system resources mentioned above and other problems without waiting for the problems to appear.

How to Do Profiling

Different programming languages have different profiling tools, such as pprof (of Go) and jstack (of Java). Here, we want to observe the application but put aside the difference in languages. Therefore, we use eBPF to obtain program stack information. The stack information here includes all the information of an application's execution in user mode and kernel mode. The advantage of using eBPF is that we can control the profiling process: frequency, runtime security, small system resource usage, etc.

As shown in the following figure, we can regularly obtain the execution stack information of applications through eBPF and Performance Monitoring Unit (PMU) events and use the bpf map to make statistics on the stack information of each application. With the help of our early open-source Coolbpf (please see Coolbpf Is Open-Source! The Development Efficiency of the BPF Program Increases a Hundredfold), we have made relevant adaptations to different kernel versions. Please see the following for specific executable versions.

Behavioral Logic of Profiling Applications

The runtime status of a program can be summarized as executing and not executing, namely on cpu and off cpu.

on cpu: We want to see the execution logic when the program occupies the CPU and which task or section of code of the task consumes resources on the CPU.
off cpu: We want to see whether the application voluntarily gives up the CPU and why it does not occupy the CPU (such as pending IO, etc.) to find some problems (such as packet receiving and sending delay caused by application loss).

For common problems (such as network jitter), we do the following in two stages of receiving the packet:

Hardware interrupt and software interrupt in packet receiving
The system call of the user-state application process fetches the packet and makes relevant profiling observations:

How to Do Profiling Continuously

We adopt c/s architecture as a whole. In daily problem positioning, we only need to deploy agents to be responsible for profiling, and we view data on the server side. At the same time, slice the profiling data, periodically take the data from the map, and clear the sampling data of the previous cycle of the map to ensure that we see the profiling results of the corresponding time period when we view the data playback. Considering users' requirements for data security in the cloud environment, we can also use SLS to complete data uploads.

Instructions

You can use this feature in SysOM with two ways:

If you want to observe the system continuously, you can use the profiling function in monitoring mode. The corresponding path is: Monitoring Center → Machine IP → General dashboard → sysom-profiling.

If you want to obtain some conclusive information about profiling, you can use the diagnostic mode. The corresponding path is: Diagnostic Center → Scheduling Diagnostic Center → Application Profile Analysis.

The CPU usage percentage of the top 10 applications will be counted, and the hotspot stack information will be displayed at the same time. The percentage of hotspot stack information in the application is also collected. Finally, the hotspot stack is analyzed to make it clear whether the hotspot stack is in the application itself or in the OS.

The preceding section shows the panel information of on cpu. The panel information of other profiling features is being continuously adapted. The specific profiling function can be obtained by executing the raptor under the unified monitoring directory of sysAK. In addition to the function items, the running mode can be set.

Running Mode

General Mode	Trigger Mode	Filter Mode
Profiling is triggered regularly. Profiling is performed every five minutes by default.	Specify the metric threshold. This mode is triggered when metric exceptions occur.	Specify the application or CPU. This mode is suitable for scenarios where abnormal applications are specified or problems only occur on specific CPUs that applications are bound to.

Kernel Version Supported by the Profiling Feature

CentOS 7.6 or later, Alinux2/3, Anolis, and ARM architecture

Related Cases

1. The CPU index of a user occasionally increases, but other machines running the same service are normal. It is suspected that there is a problem with ECS resources.

The ECS monitoring graph is listed below:

Due to intermittent jitter, it is difficult to catch the scene by conventional means. After performing profiling on the system for one day, it is found that the Nginx application occupies the most CPU resources on the system during the jitter time, and Nginx is mainly doing packet receiving and sending. The problem is solved after users optimize the distribution of business traffic requests.

2. The sys index rises inexplicably when the business pressure of a user's system does not increase.

The user monitoring graph is listed below:

After performing profiling on the system for observation, it is found that a cachestat script starts the ftrace function. This script is not stopped in time after the previous developers locate problems and deploy it. After the script was stopped, the system returned to normal. Since ftrace is not enabled through the sysfs directory, the ftrace of sysfs is not changed.

3. The CPU index of a user's machine is abnormal, ssh login fails, and the whole machine crashes

The user monitoring graph is listed below:

According to experience, it is generally suspected that all these problems (including ssh login failure, CPU index anomaly, and memory pressure) are caused by the system doing memory reclaim. However, it is usually impossible to get the first on-site evidence, so this speculation is not convincing. We caught the first scene Through profiling after one-day deployment. At 13:57, the CPU usage was extremely high, and the line rose from the ground. According to the system behavior, the kernel occupied the CPU for memory reclaim, and users were advised to optimize the memory usage of the application. This issue can be regarded as a classic issue in the cloud environment.

4. The ping message of a user machine occasionally has a second-level delay jitter.

The ping host occasionally has a second-level delay jitter, and individual CPU sys is occasionally occupied up to 100%.

Since it is a short jitter of individual CPUs, we perform profiling on the execution status on a certain CPU. We can see at the time when network jitter occurs, runc is reading the /proc/cpuinfo information, and there are hotspots in the "smp_call_function_single" call among cores, which is consistent with the phenomenon of occasional high usage of sys. Finally, containers cache and back up cpuinfo information to reduce access to /proc/cpuinfo and relieve system pressure. The access to /proc/cpuinfo has also been partially optimized in the high-version kernel. Please see [4] for the relevant patch.

Summary

SysOM is committed to building an automated O&M platform that integrates host management, configuration deployment, monitoring and alarm, exception diagnosis, security audit, and other functions. The article above is an introduction to SysOM profiling. Only some cases are presented. The related functional modules have been verified and are being open-sourced. Please look forward to it.

Please see our Gitee open-source O&M repository for more O&M technologies:

SysOM: https://gitee.com/anolis/sysom

SysAK: https://gitee.com/anolis/sysak

Coolbpf: https://gitee.com/anolis/coolbpf

References

[1] Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers

[2] Observability Engineering

[3] http://www.brendangregg.com/perf.html

[4] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2414427.html

Community

Application Observation Practice of SysOM Profiling in the Cloud Environment

Background Information

Implementation Scheme

How to Do Profiling

Behavioral Logic of Profiling Applications

How to Do Profiling Continuously

Instructions

Running Mode

Kernel Version Supported by the Profiling Feature

Related Cases

1. The CPU index of a user occasionally increases, but other machines running the same service are normal. It is suspected that there is a problem with ECS resources.

2. The sys index rises inexplicably when the business pressure of a user's system does not increase.

3. The CPU index of a user's machine is abnormal, ssh login fails, and the whole machine crashes

4. The ping message of a user machine occasionally has a second-level delay jitter.

Summary

References

Read previous post:

Read next post:

OpenAnolis

You may also like

Comments

OpenAnolis

Related Products

Bastionhost

Managed Service for Grafana

Function Compute

Elastic High Performance Computing Solution