Locate the cause of high response latency by using access logs - Alibaba Cloud Service Mesh

After sidecar proxies are injected into workloads, they intercept and route traffic based on the specified policy, adding a small amount of processing overhead per request. With adequate node performance, this overhead is negligible for concurrent processing. When response latency exceeds expectations, use the Envoy access log timing fields to isolate the source.

This guide walks through a two-step diagnostic process:

Compare duration values across components to identify which one introduces the delay.
Examine detailed timing fields to determine whether slow network transmission or slow upstream processing is the root cause.

Access log timing fields

The following Envoy access log fields are used throughout this guide:

Field	Description
`duration`	Total time consumed by a data plane component to process a request, from receiving the request through sending the complete response
`request_duration`	Time to receive the request from the downstream node
`request_tx_duration`	Time to forward the request to the upstream service
`response_duration`	Time from sending the request to receiving the first byte of the response
`response_tx_duration`	Time to forward the response to the downstream node

Step 1: Identify the component that causes high latency

The duration field represents the total time a data plane component spends on a single request, including:

Receiving and forwarding the request to the upstream service
Waiting for the upstream service to return a response
Receiving and forwarding the response to the downstream node

To isolate the problematic component, trace the request path from the entry point upstream:

Check the duration value at the entry point of the request path.
If the value is higher than expected, move to the next upstream component and check its duration.
If the upstream component shows a normal duration, the previous (downstream) component is the source of the delay.
If the upstream component also shows high duration, continue upstream until you find the first component with a normal value.

The component immediately downstream of the first normal-duration component is the one causing the latency.

Step 2: Determine the root cause

After you identify the problematic component, examine its detailed timing fields to determine whether the root cause is slow network transmission or slow upstream processing.

Slow network transmission

Compare request_duration and request_tx_duration:

High request_duration: The data plane component (sidecar proxy or gateway) is slow to receive the request from the downstream node.
High request_tx_duration: The component is slow to forward the request to the upstream service.

For HTTP requests with a body, receiving and forwarding happen simultaneously -- the body is streamed to the upstream service as it is received, rather than buffered first. A high request_duration can therefore cause a correspondingly high request_tx_duration.

Interpret the results based on the pattern:

Pattern	Likely cause	Action
Only `request_tx_duration` is high	The request is received quickly but forwarded slowly	Investigate the network path between the component and its upstream service
Both `request_duration` and `response_tx_duration` are high	The response is received slowly from the upstream service or forwarded slowly to the downstream node	Investigate overall network conditions between the component and its peers

Slow upstream processing

Calculate the upstream processing time:

upstream processing time = response_duration - request_tx_duration

response_duration: Time from sending the request to receiving the first byte of the response.
request_tx_duration: Time spent forwarding the request.

The difference represents the time the upstream service spends processing the request. A large value indicates slow upstream processing or high network latency on the upstream path. Investigate the upstream service's performance and the network between the component and the upstream service.