Five Classic Problems in Process Analysis

By Yahai

"The Third Usage Pattern" of Tracing Analysis

When it comes to tracing analysis, we will naturally think of using a call chain to troubleshoot exceptions in a single request or using pre-aggregation process statistical metrics for service monitoring and alerting. Tracing analysis also has a third use. It can delimit problems faster than the call chain. It can implement custom diagnosis more flexibly compared with the pre-aggregation monitoring chart. Thus, it can be used for post-aggregation analysis based on process detailed data (or link analysis for short).

Link analysis is based on the stored full-process detail data, providing a free combination and filtering of conditions and aggregation dimensions for real-time analysis. It can meet the custom diagnosis requirements of different scenarios. For example, we might want to see the time series distribution of slow calls that take more than three seconds, the distribution of error requests on different machines, or the traffic changes of VIP customers. This article will introduce how to quickly locate five classic online problems through process analysis. This will help you understand the usage and value of process analysis.

Five Classic Problems in Process Analysis

The usage of link analysis based on post-aggregation is very flexible. This article lists only the five most typical scenarios, but you are welcome to share your opinions on other scenarios.

Uneven Traffic: Configuration Errors of Load Balancing Cause a Large Number of Requests Sent to a Small Number of Machines, Causing "Hot Spots" to Affect Service Availability. How Can We Tackle This Issue?

The problem of hot spot breakdown caused by uneven traffic can easily cause service unavailability. There are too many such cases in the production environment. These cases are caused by configuration errors of load balancing, registry exceptions causing services of restart nodes to be unable to launch, or abnormal DHT hash factors.

The biggest risk of uneven traffic is whether the hot spot can be found in time. It is reflected in slower service response or error reporting. Traditional monitoring cannot directly reflect the hot spot issue. Therefore, most of us do not consider this factor the first time, wasting valuable emergency treatment time and causing the failure impact to spread continuously.

Through link analysis, process data is recorded by the IP group to help quickly find out which machines call requests are distributed on, especially to find out the traffic distribution changes before and after the problem occurs. If a large number of requests are suddenly concentrated on one or a small number of machines, it is likely to be a hot spot issue caused by uneven traffic. With the change event when the problem occurred, we can quickly locate the error change that caused the fault and roll back in time.

Standalone Failure: Damage of Network Interface Controller, CPU Oversold, Full Disk, and Other Single-Machine Failures May Cause Failure or Timeout of Some Requests. How Can We Troubleshoot It?

Standalone failures occur all the time, especially in the core cluster due to a large number of nodes. From the perspective of statistical probability, it is almost an inevitable event. A standalone failure does not cause a large area of service unavailability but causes a small number of user request failures or timeouts. It continuously affects the user experience and brings question-and-answer costs. Therefore, such problems need to be handled in a timely manner.

Standalone failure can be divided into two types: host failure and container failure. (It can be divided into Node and Pod in a Kubernetes environment.) For example, CPU oversold and hardware failure are at the host level and affect all containers, but failures (such as full disk and memory overflow) only affect a single container. Therefore, when troubleshooting standalone failure, we can analyze it from the dimensions of the host IP and container IP separately.

When faced with such problems, we can filter out abnormal or timeout requests through link analysis. Then, we perform aggregation analysis according to the host IP or container IP to quickly determine whether there is a standalone failure. If abnormal requests are concentrated on a single machine, we can try to replace the machine for quick recovery. Alternatively, we can check various system parameters of the machine, such as whether the disk space is full or the CPU steal time is too high. If abnormal requests are distributed on multiple machines, the possibility of standalone failure can be ruled out. You can focus on analyzing whether downstream dependent services or program logic are abnormal.

Slow Interface Governance: How Can We Quickly Tease out the Slow Interface List and Solve the Performance Bottleneck before the Launch of New Applications or Promotion Activities?

A systematic performance tuning is usually required when a new application is launched or when making preparations for promotion activities. The first step is to analyze the performance bottlenecks in the current system and tease out the slow interface list and its frequency.

At this time, the calls with time consumption greater than a certain threshold can be filtered out through link analysis and grouped for recording according to the interface name. This way, the list and rules of slow interfaces can be quickly located. Then, we can govern the slow interfaces with the highest frequency one by one.

After finding the slow interfaces, we can locate the root causes of slow calls by combining data from the relevant call chain, method stack, and thread pool. Common causes include the following categories:

The connection pool of the database or microservice is too small, and a large number of requests are in the get connection state. You can increase the maximum number of threads in the connection pool to solve this.
N +1 Issue: For example, if an external request makes hundreds of internal database calls, fragmented requests can be merged to reduce network transmission time.
If the data in a single request is too large, the network transmission and deserialization time will be too long, and FGC will likely occur. You can change a full query to a paged query to avoid requesting too much data at a time.
Log Framework "Hot Lock": You can change the synchronous output of logs to asynchronous output.

Business Traffic Statistics: How Can We Analyze Traffic Changes and Service Quality for Reinsurance Customers and Channels?

In the production environment, services are usually standardized, while businesses need to be classified and graded. For the same order service, we need to classify it by category, channel, user, and other dimensions to achieve refined operations. For example, for offline retail channels, the stability of every order and POS machine may trigger public opinion. Service-level Agreement (SLA) requirements for offline channels are much higher than for online channels. How do we accurately monitor the traffic status and service quality of offline retail processes in a common e-commerce service system?

Here, we can use custom attributes filtering and statistics of link analysis to achieve low costs. For example, we put a label {"attributes.channel": "offline"} on the portal service for offline orders. Then, we can add labels according to different stores, user groups, and product categories separately. Finally, we can quickly analyze the traffic trend and service quality of each type of business scenario by setting attributes.channel=offline to do filtering and using Group By statement to group different business tags and calculate call times.

Gray Release Monitoring: 500 Machines Are Launched in Ten Batches. How Can We Quickly Determine Whether There Is Any Abnormality after the First Batch of Gray Release?

The capability of gray release, monitoring, and rollback is an important criterion for ensuring online stability. Among them, batch gray change is the key means to reducing online risk and controlling the explosion radius. Once we find the service status of the gray batch abnormal, we should roll back in time instead of continuing to release. However, many failures in the production environment are caused by the lack of effective gray monitoring.

For example, when the microservice registry is abnormal, the republished machine cannot be launched for service registration. Due to the lack of gray monitoring, all the restarted machines of the first several batches fail to register, resulting in all traffic being centrally routed to the last batch of machines. However, the overall traffic and duration of the application monitoring do not change significantly until the last batch of machines fails to register. The entire application enters a completely unavailable state, causing serious online failures.

In the case above, if you label the version of different machine traffic {"attributes.version": "v1.0.x"} and use link analysis to group and record attributes.version, you can clearly distinguish the traffic changes and service quality of different versions before and after the release. There will not be gray batch exceptions masked by global monitoring.

Limits

Although link analysis is very flexible and can meet the custom diagnostic requirements of different scenarios, it also has several usage constraints:

The cost of analysis based on process detail data is high. The premise of link analysis is to report and store process detail data as completely as possible. If the sampling rate is relatively low, causing the detailed data to be incomplete, the effect of link analysis will be significantly affected. Edge data nodes can be deployed in user clusters for temporary data caching and processing to reduce full storage costs, reducing cross-network reporting overhead. We can also adopt separate storage of hot and cold data on the server, hot storage for full link analysis, and cold storage for slow and invalid diagnosis.
Post-aggregation analysis has high-performance overhead and low concurrency and is inapplicable to alerting. Link analysis performs full data scanning and statistics in real-time. Its query performance overhead is much greater than the pre-aggregation statistical metrics. Therefore, it is not suitable for highly concurrent alert queries. We need to push the post-aggregation analysis statement to the client for custom metric statistics in combination with the custom metric feature. This is to support alert and dashboard customization.
Combine custom tag tracking to maximize the value of link analysis. Link analysis is different from standard pre-aggregation metrics of application monitoring. Users need to manually add tracking points and mark tags for tags in many custom scenarios. This can most effectively distinguish different business scenarios and implement an accurate analysis.

Link Analysis Facilitates APM

Link data contains rich values. Traditional call chains and service views are just two classic usages in fixed mode. Link analysis based on post-aggregation can fully unleash the flexibility of diagnosis, meeting the requirements of custom diagnosis in any scenario and dimension. Combined with custom indicator generation rules, it can significantly improve the precision of monitoring and alert and facilitate your Application Performance Management (APM). We welcome everyone to explore, experience, and share!

Tracing Analysis Product Page

Community

Five Classic Problems in Process Analysis

"The Third Usage Pattern" of Tracing Analysis

Five Classic Problems in Process Analysis

Uneven Traffic: Configuration Errors of Load Balancing Cause a Large Number of Requests Sent to a Small Number of Machines, Causing "Hot Spots" to Affect Service Availability. How Can We Tackle This Issue?

Standalone Failure: Damage of Network Interface Controller, CPU Oversold, Full Disk, and Other Single-Machine Failures May Cause Failure or Timeout of Some Requests. How Can We Troubleshoot It?

Slow Interface Governance: How Can We Quickly Tease out the Slow Interface List and Solve the Performance Bottleneck before the Launch of New Applications or Promotion Activities?

Business Traffic Statistics: How Can We Analyze Traffic Changes and Service Quality for Reinsurance Customers and Channels?

Gray Release Monitoring: 500 Machines Are Launched in Ten Batches. How Can We Quickly Determine Whether There Is Any Abnormality after the First Batch of Gray Release?

Limits

Link Analysis Facilitates APM

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Microservices Engine (MSE)

ACK One

Big Data Consulting for Data Technology Solution

Cloud Migration Solution