Case Studies for System Performance Monitoring Tool ssar

By Guangcheng (Wen Maoquan) and Wuya (Li Jingxuan), core members of the OpenAnolis Tracking Diagnosis Technology SIG.

Introduction

ssar is a system performance monitoring tool developed by Alibaba and contributed to OpenAnolis. The tool has a small overhead and has been deployed on a large scale across multiple business units in the company for more than one year.

ssar tools have recently been open-source in leading operating system open source community and innovation platform – OpenAnolis. All are welcome to visit OpenAnolis to download the source code. The compilation and installation steps are provided together with the source code, and the specific "User Reference Manual" is also available.

The ssar tool solves many practical production problems in daily use. For a quick understanding of the use of ssar, here we introduce typical usage through 10 specific user scenarios.

Scenario 1: Process-level Factors when Locating CPU and Memory Fluctuations

In daily operation and maintenance, the CPU and memory of the whole machine often increase, which causes system instability. It is necessary to be able to review historical data to understand process-level CPU and memory profiles during the fluctuation periods.

ssar allows detailed view on the trend of CPU and memory changes of the whole machine. User, system, and anonpages are the indicator names used by the kernel to refer to user space CPU usage, kernel space CPU usage, and anonymous page memory.

$ ssar -i 1 -o user,system,anonpages
collect_datetime       user/s   system/s  anonpages
2021-09-08T18:36:00    787.38     145.30    6590885 
2021-09-08T18:37:00    299.93     230.25   12808418 
2021-09-08T18:38:00    116.58      79.43   14119096 
2021-09-08T18:39:00    401.88     101.75   10580258 
2021-09-08T18:40:00    332.68     279.62    8759586

Use ssar to view process-level CPU impact factors. The procs subcommand displays a list of the top 3 processes in kernel state CPU usage during the 2-minute duration from 2021-09-08T18:38:00 to 2021-09-08T18:40:00. Pucpu, pscpu, and pcpu indicate, respectively, the user space CPU utilization, kernel space CPU utilization, and total CPU utilization for this process.

$ ssar procs -f 2021-09-08T18:40:00 -r 2 -o pid,ppid,pcpu,pucpu,pscpu,cmd -k pscpu -l 3
    pid    ppid    pcpu   pucpu   pscpu cmd             
   1537    1536    95.3    65.0    30.3 java
  29349       1     0.3     0.1     0.2 sresar 
     11       2     0.1     0.0     0.1 migration/0

Use ssar to view the process-level memory impact factors. The procs subcommand displays a list of the 3 processes with surging memory requests during the 2-minute duration from 2021-09-08T18:36:00 to 2021-09-08T18:38:00. rss_dlt indicates the delta-increment of rss memory.

$ ssar procs -f 2021-09-08T18:38:00 -r 2 -o pid,ppid,rss,rss_dlt,nlwp,cmd -k rss_dlt -l 3
  pid     ppid        rss    rss_dlt nlwp cmd              
197779       1      14624       9472    1 syslog-ng       
185017       1     136328       5400    1 systemd-journal 
27495    77722        360        360    1 sleep

Scenario 2: Tracking CPU and Memory Fluctuations for a Single Process

Through the preceding method, the process that affects the fluctuation of the whole machine index is discovered, and the CPU and memory changes of a specific process can be further tracked by proc subcommands. The left angle bracket (<) indicates that the sresar process has not started at the 2021-08-09T11:39:00 time, and the right angle bracket (>) indicates that the sresar process has ended at the 2021-08-09T11:47:00 time.

$ ssar proc -p $(pidof sresar)  -i1 -o collect_datetime,rss,min_flt,cmd
collect_datetime          rss     min_flt cmd     start_datetime
2021-08-09T11:39:00          <          < <       <
2021-08-09T11:40:00       1524        631 sresar  2021-08-09T11:39:18
2021-08-09T11:41:00       1708       1231 sresar  2021-08-09T11:39:18
2021-08-09T11:42:00       3552       1748 sresar  2021-08-09T11:39:18
2021-08-09T11:43:00       3552       1748 sresar  2021-08-09T11:39:18
2021-08-09T11:44:00       3552       1749 sresar  2021-08-09T11:39:18
2021-08-09T11:45:00       3552       1749 sresar  2021-08-09T11:39:18
2021-08-09T11:46:00       3552       1749 sresar  2021-08-09T11:39:18
2021-08-09T11:47:00          >          > >       >

$ ssar proc -p $(pidof sresar) -i 1 --mem
$ ssar proc -p $(pidof sresar) -i 1 –cpu

Scenario 3: Recording System Load Information on 5-second granuality

ssar of load5s sub-command not only improves the accuracy of load to 5s but also creatively adds load5s indicators. load5s is more precise than load1, load5, and load15 indicators and can better reflect the system profile at the time of load pressure. The legacy load1 index still has a certain lag after the pressure on system load has subsided. In comparison, the load5s index provided by ssar refectly the machine load pressure more trufully.

$ ssar load5s
collect_datetime       threads    load1   runq  load5s
2021-08-09T14:17:35       1047     0.28      2      1
2021-08-09T14:17:40       1058     0.25      1      0
2021-08-09T14:17:47       3047   113.46   1453   1414
2021-08-09T14:17:53       3053   264.62   2002   2001
2021-08-09T14:17:59       3053   403.74   2002   2002
2021-08-09T14:18:05       1049   371.41      1      0
2021-08-09T14:18:10       1055   341.67      1      0
2021-08-09T14:18:15       1048   314.31      1      0

$ ssar load5s -f 2021-07-15T19:00:00 -r 300 -z

Scenario 4: Recording the Thread D Status of Abnormal Stack

If the load5s value is above a certain threshold of CPU cores, the collection of load details will be triggered, in which the kernel call stack is sampled and recorded for D-status threads. Use load2p to display detailed D status call stack information.

$ ssar load2p -c 2021-09-08T17:25:00 --stackinfo
# The system problem can be identified through call stack information.
page_fault
  do_page_fault
    __do_page_fault
      handle_mm_fault
        __do_fault
          filemap_fault
            __lock_page_or_retry
              wait_on_page_bit_killable
                sleep_on_page_killable

Scenario 5: Diagnose the High-level Memory Shortage of the Whole Machine

Problems related to insufficient system memory constitute a category of most common problems in daily operation and maintenance. Sometimes we will see logs such as “page allocction failure: order:4" from the system logs. This is a manifestation of insufficient memory available. With memory fragmentation, the kernel space is unable to obtain high-level memory.

To identify such problems, it is necessary to diagnose memory profile of the whole machine to understand the memory insufficiency. The anonpages index is an anonymous page. The memfree index is the free memory of the whole machine. The pgsteal_kswapd index is the amount of asynchronous memory reclaimed per second, and the pgsteal_direct is the amount of direct memory reclaimed per second.

$ ssar -i 1 -o anonpages,memfree,pgsteal_kswapd,pgsteal_direct
collect_datetime     anonpages    memfree pgsteal_kswapd/s pgsteal_direct/s 
2021-09-09T10:40:00   11151058    1716004             0.00             0.00 
2021-09-09T10:41:00   12008117    1615129             0.00             0.00 
2021-09-09T10:42:00   13993906    1404292             0.00             0.00 
2021-09-09T10:43:00   15869526    1233360             0.00             0.00 
2021-09-09T10:44:00   17135683    1083402         43821.88             0.00 
2021-09-09T10:45:00   19488155     829303         45080.95             0.00 
2021-09-09T10:46:00   20787968     712429         10452.02           455.90 
2021-09-09T10:47:00   21725142     614698             0.00          2313.68

It can be seen that with the growth of the anonpages memory of the system, the memfree free memory decreases. After reaching a certain threshold, asynchronous memory reclamation pgsteal_kswapd starts and the reduction of free memory causes direct memory reclamation pgsteal_direct.

Under the premise of the serious shortage of free memory, there will be kernel high-level memory shortage. From the data, it can be seen that the order4 memory inventory at the 2021-09-09T10:46:00 time is 0.

$ ssar -i 1 --node0
collect_datetime        order0     order1     order2     order3     order4     order5
2021-09-09T10:40:00     139116     429294     226754      51287      37276      44181
2021-09-09T10:41:00      20070      89672      37399      51721      37277      44181
2021-09-09T10:42:00      31154      71323      36612      49952      37277      44181
2021-09-09T10:43:00      73309      93733      36843      49249      36256      44157
2021-09-09T10:44:00      76885      73331      36409      48408      35477      42181
2021-09-09T10:45:00      46915     140381      40744      47815      34313      44171
2021-09-09T10:46:00      28346      16054       6420        141          0          0
2021-09-09T10:47:00      34694      29549      16750       3042       1481        319

Scenario 6: Positioning the Cgroup Level of Memory Bumps

When there is insufficient memory at the cgroup level, there is also a memory bump problem, which can cause serious system problems.

$ tsar2 --io -i 1 -I nvme0n1p1 -s rs,rarqsz,util
Time           nvme0n1p1 nvme0n1p1 nvme0n1p1
Time                  rs    rarqsz      util
09/09/21-14:50     66.1K     10.25    100.00
09/09/21-14:51     71.1K     13.25    100.00
09/09/21-14:52     63.8k     10.00    100.00
09/09/21-14:53     57.0k     14.50    100.00
09/09/21-14:54     64.0k     14.00    100.00

With the superior performance of nvme disks, sometimes the disk util can surge to 100% and the corresponding disk read IO can be observed to be in the tens of thousands, while a single read IO size can be just over 10 Kb in size. One of the most common causes of this problem is cgroup-level memory bumps. Through the kernel index pgmajfault, it can be seen that the number of main missing page interrupts per second is high while the IO is full, which is the direct cause of the system IO full.

$ ssar -i 1 -o pgmajfault
collect_datetime    pgmajfault/s
2021-09-09T14:50:00     43491.31 
2021-09-09T14:51:00     44432.01 
2021-09-09T14:52:00     45464.78 
2021-09-09T14:53:00     43991.22 
2021-09-09T14:54:00     45922.34

$ ssar procs -f 2021-09-09T14:51:00 -r 1 -o pid,rss,maj_flt,maj_dlt,cmd -k maj_dlt -l 5
    pid        rss    maj_flt    maj_dlt cmd             
  58532    8240736  126575484    1305199 java 
  46873    8237292  173405708    1257924 java 
  51357     577796      13201         63 python 
 253136     566352      11687         46 syslog-ng
  46025    1229520        478         42 bash

Further, through the ssar procs subcommand, you can locate the 2 java processes pid 58532 and 46873 that have a large number of main and missing page interrupts. The reason for a large number of main and missing page interrupts in java processes is that the memory rss process is close to the memory.limit_in_bytes value set by the cgroup. Therefore, it results in insufficient memory space for the clean state code segments in the cgroup group to load all segments. Therefore, code segments can only be discarded and read from the disk during program execution.

Scenario 7: Network TCP Retransmission of Extended Information

At the host network level, routine O&M also suffers from problems such as network packet loss, retransmission, and disorder. Several sets of more in-depth network metrics are also provided here for diagnosing network problems.

$ tsar2 --retran
Time           --retran--- --retran--- --retran-- --retran-- -retran- ----retran----
Time           RetransSegs FastRetrans LossProbes RTORetrans RTORatio LostRetransmit
08/09/21-16:45       49.44       34.41      16.02       6.41    12.97           0.26
08/09/21-16:50       44.07       29.03      16.84       8.21    18.63           0.13
08/09/21-16:55       37.58       18.72      10.91      11.44    30.44           0.02

$ tsar2 --tcpofo
Time           -tcpofo-- ---tcpofo--- ---tcpofo--- -tcpofo- tcpofo- -tcpofo-
Time           OfoPruned DSACKOfoSent DSACKOfoRecv OFOQueue OFODrop OFOMerge
08/09/21-16:45      0.00         0.00         0.00   251.86    0.00     0.00
08/09/21-16:50      0.00         0.00         0.00   182.35    0.00     0.00
08/09/21-16:55      0.00         0.00         0.00   115.70    0.00     0.00

In terms of tcp retransmission, we provide many more detailed indicators than the tcp retransmission rate. Modern Linux systems have a variety of retransmission methods in addition to 200ms timeout retransmission to accelerate network jitter or recovery under abnormal conditions. Here, we classify the TCP retransmission in detail, including fast retransmission, tail packet retransmission, and overtime retransmission. Different retransmissions have different impacts on the service. We may observe that fast retransmission or tail packet retransmission more often, but overtime retransmission is not common, which does not affect the service. However, once there is a large number of timeout retransmissions, it often means real network exceptions.

--tcpofo allows us to see some disorder situations that we currently receive. Disorder and retransmission are often associated. Generally, when we receive disorder messages, we see OFOQueue increase. However, once we see OFODrop, the received disorder messages are lost. At this time, it is likely that there are too many disorder messages in our queue, or our application process fails to collect packets from the receive queue of tcp in time.

The preceding data can be combined with ssar historical data to find some retransmission or disorder anomalies, which can help us locate the causes of some tcp problems.

$ tsar2 --tcpdrop
Time           ----tcpdrop----- --tcpdrop-- ----tcpdrop---- ----tcpdrop---- --tcpdrop--
Time           LockDroppedIcmps ListenDrops ListenOverflows PrequeueDropped BacklogDrop
08/09/21-16:45             0.00        0.28            0.28            0.00        0.00
08/09/21-16:50             0.00        1.97            1.97            0.00        0.00
08/09/21-16:55             0.00       21.96           21.96            0.00        0.00

$ tsar2 --tcperr
Time           ---tcperr--- ---tcperr--- ---tcperr--- ---tcperr---- --tcperr---
Time           RenoFailures SackFailures LossFailures AbortOnMemory AbortFailed
08/09/21-16:45         0.00         0.00         0.00          0.00        0.00
08/09/21-16:50         0.00         0.00         0.00          0.00        0.00
08/09/21-16:55         0.00         0.00         0.00          0.00        0.00

Some typical packet losses and exceptions of TCP can be observed through --tcpdrop/--tcperr. For example, we can see that there are more ListenDrops and ListenOverflows at 08/09/21-16:55, which indicates that there are more TCP SYN connection requests, resulting in the TCP accept queue being full. This indicates that the new connection requests arriving at the server are faster than the server accept() consuming new connection requests. A large number of connection requests may have occurred, or the server has not called the accept() for a long time to consume the SYN requests in the accept queue.

Scenario 8: Custom Collection of Any System Standard

Driver developers are likely to want to record some performance data in their driver modules. ssar tool provides a simple and complete collection scheme for this scenario. Developers only need to expose the performance metrics in kernel modules to the user space by using pseudo files in /proc/ or /sys/. The data can be a cumulative value or an instantaneous one. For example, there are 2 values in mydev pseudo file. The indicator1 on the left is the cumulative value, and the indicator2 on the right is the instantaneous value.

$ cat /proc/mydev
104377580      7252

The ssar provides a simple way to configure and collect data.

$ cat /etc/ssar/sys.conf 
[file]
default=[
    {src_path='/proc/mydev', turn=true, cfile='mydev'},
]

$ sudo systemctl restart sresar               # Restart the collection process configuration file

After modifying the configurations, you can run the following command to obtain historical data.

$ ssar -i 1 -f 2021-09-09T17:20:00 -r 60 -o 'metric=d:cfile=mydev:line=1:column=1:alias=indicator1;metric=c:cfile=mydev:line=1:column=2:alias=indicator2'
collect_datetime     indicator1/s  indicator2
2021-09-09T17:30:00        884.40        7192
2021-09-09T17:31:00        652.15        7078
2021-09-09T17:32:00        799.73        7676
2021-09-09T17:33:00        708.37        7953
2021-09-09T17:34:00        596.47        7738

The command can be changed to obtain real-time data as well:

$ ssar -i 1s -f +10 -o 'metric=d|cfile=mydev|line=1|column=1|alias=indicator1;metric=c|cfile=mydev|line=1|column=2|alias=indicator2'
collect_datetime     indicator1/s  indicator2
2021-09-09T17:40:00        668.50        7004
2021-09-09T17:40:01        480.38        6115
2021-09-09T17:40:02        651.55        6496
2021-09-09T17:40:03        789.15        6433
2021-09-09T17:40:04        867.93        7508

ssar tool does not require installation, simply copy ssar program to obtain real-time data.

$ ./ssar -i 1s -f +10 -o 
'metric=d|src_path=/proc/mydev|line=1|column=1|alias=indicator1;metric=c|src_path=/proc/mydev|line=1|column=2|alias=indicator2'

If multiple rows or columns have similar meanings, you can refer to the following 2 usages.

$ ssar -o 'metric=d:cfile=stat:line=2:column=2-
11:alias=cpu0_{column};' # cpu0 Data in columns 2 to 11 
$ ssar -o 'metric=d|cfile=stat|line=2-
17|column=5|alias=idle_{line};'   # cpu0 to 15 of idle

ssar supports the case when the indicator is not an integer, but in the form of real number or strings, for example:

$ ssar -o 'metric=c:cfile=loadavg:line=1:column=1:dtype=float:alias=load1'
# load1real data
$ ssar -o 'metric=c|cfile=loadavg|line=1|column=4|dtype=string|alias=rp'
# rp string information

If you wish to collect mydev, ssar also provides two configuration parameters load5s_flag and proc_flag to disable load and process-level metrics collection. This saves unnecessary overhead in disk storage.

Scenario 9: Diagnosis of Specific Uneven CPU Interruption

Excessive consumption of softirq resources for a specific CPU affects user-space resource usage on this CPU. Therefore, when interruptions are unevenly distributed, the execution of user-space processes will be affected. For diagnosis, one would first need to understand whether softirq resource consumption is balanced across CPUs.

$ tsar2 --cpu -I cpu,cpu0,cpu1,cpu2,cpu3,cpu4,cpu5,cpu6,cpu7 -s sirq -i 1
Time           --cpu-- -cpu0-- -cpu1-- -cpu2-- -cpu3-- -cpu4-- -cpu5-- -cpu6-- -cpu7--
Time              sirq    sirq    sirq    sirq    sirq    sirq    sirq    sirq    sirq
09/09/21-21:15    1.44    0.01    0.01    0.01    2.28    0.02   15.69    2.72    5.57
09/09/21-21:16    1.98    0.01    0.01    0.01    2.56    0.03   18.40    2.56    8.93
09/09/21-21:17    2.61    0.01    0.01    0.01    2.31    0.09   17.28    2.21    3.59
09/09/21-21:18    1.84    0.01    0.01    0.01    1.97    0.03   12.16    2.04    8.75
09/09/21-21:19    1.32    0.01    0.01    0.01    1.66    0.03   16.52    1.69    6.77

It can be seen from the command output that the softirq is not evenly distributed among various CPUs, and the cpu5 softirq is more than 10%. For multi-core CPU, you can use the tsar2 cputop subcommand to sort the CPU according to its softirq footprint.

Tsar2 irqtop can quickly locate the high softirq resource ratio of cpu5 because as a result of 158 interruption named virtio-output.1, and shows the average number of interruptions per second.

$ tsar2 irqtop -i 1 -C 5   
Time           --N1--- N1- N1- --------N1-------- --N2--- N2- N2- --------N2------
Time             count irq CPU name                 count irq CPU name             
09/09/21-22:58  981.80 158 5   virtio-output.1     304.02 154 5   nvme0q0, nvme0q1
09/09/21-22:59  765.32 158 5   virtio-output.1     234.97 154 5   nvme0q0, nvme0q1
09/09/21-23:00  763.63 154 5   nvme0q0, nvme0q1    393.88 158 5   virtio-output.1
09/09/21-23:01  531.87 158 5   virtio-output.1     225.03 154 5   nvme0q0, nvme0q1
09/09/21-23:02  557.23 158 5   virtio-output.1     150.25 154 5   nvme0q0, nvme0q1

By default, collection of historical data is not enabled for CPU-level interruptions. To enable it, one should change the interrupts configuration to the following.

{src_path='interrupts', turn=true, gzip=true},

If you need to obtain real-time data, append the -l option to the preceding command.

$ tsar2 --cpu -I cpu,cpu0,cpu1,cpu2,cpu3,cpu4,cpu5,cpu6,cpu7 -s sirq -i 1 -l
$ tsar2 irqtop -i 1 -C 5 -l

Scenario 10: Supporting Flexible Data Output and Analysis

Not only does ssar provide rich data metrics, it also produces output that is easy to use and parse. All data output of ssar supports json format, which facilitates parsing with high-level languages such as python. This can be done with a --api option in the command.

$ ssar -r 5 -o user,system,memfree,anonpages --api
[{"anonpages":"280824","collect_datetime":"2021-09-10T12:00:00","memfree":"181753024","system":"-","user":"-"},{"anonpages":"280812","collect_datetime":"2021-09-10T12:05:00","memfree":"181752704","system":"2.35","user":"1.22"}]

For shell and other script parsing, ssar also offers convenient options. For example, the -H option can be used to hide various header information.

$ ssar -r 5 -o user,system,memfree,anonpages -H
2021-09-10T12:00:00          -          -  81754836     280296 
2021-09-10T12:05:00       1.29       2.57  81756712     280228

Supplying the -P option will hide symbols such as "-", "*", and "<" that appear in the data item. This allows the upper-layer caller-script to reduce handling of abnormal data.

$ ssar -r 5 -o user,system,memfree,anonpages -H -P
2021-09-10T12:10:00       1.31       2.58  81747740     281676

code base address of the ssar tool: https://gitee.com/anolis/ssar.git

About the OpenAnolis SIG

SIG is open and strives to make deliverables part of community distribution. It is governed by core members in the group and can communicate through mailing lists with members in the group. At present, there are more than 20 SIG in the OpenAnolis, including trace diagnosis technology SIG, commercial secret software stack, high-performance storage technology SIG, Java language and virtual machine SIG, Cloud Kernel, OceanBase SIG.

You are welcome to join the tracking diagnosis technology SIG: https://openanolis.cn/sig/tracing

Community

Case Studies for System Performance Monitoring Tool ssar

Introduction

Scenario 1: Process-level Factors when Locating CPU and Memory Fluctuations

Scenario 2: Tracking CPU and Memory Fluctuations for a Single Process

Scenario 3: Recording System Load Information on 5-second granuality

Scenario 4: Recording the Thread D Status of Abnormal Stack

Scenario 5: Diagnose the High-level Memory Shortage of the Whole Machine

Scenario 6: Positioning the Cgroup Level of Memory Bumps

Scenario 7: Network TCP Retransmission of Extended Information

Scenario 8: Custom Collection of Any System Standard

Scenario 9: Diagnosis of Specific Uneven CPU Interruption

Scenario 10: Supporting Flexible Data Output and Analysis

About the OpenAnolis SIG

Read previous post:

Read next post:

OpenAnolis

You may also like

Comments

OpenAnolis

Related Products

Application Real-Time Monitoring Service

Elastic High Performance Computing Solution

Elastic High Performance Computing

Remote Rendering Solution