Circuit breaking is a traffic management mechanism used to protect your system from further damage in the event of a system failure or overload. In traditional Java services, frameworks such as Resilience4j can be used to implement circuit breaking. Compared with the traditional approaches, Istio allows you to implement circuit breaking at the network level without integrating circuit breaking into the application code of each service. You can configure the connectionPool field to implement circuit breaking. This improves system stability and reliability and protects desired services from being affected by abnormal requests.
Prerequisites
A Container Service for Kubernetes (ACK) cluster is added to your Service Mesh (ASM) instance. For more information, see Add a cluster to an ASM instance.
connectionPool settings
Before you enable the circuit breaking feature, you must create a destination rule to configure circuit breaking for the desired destination service. For more information about the fields in a destination rule, see Destination Rule.
The connectionPool
field defines parameters related to circuit breaking. The following table describes the parameters of the connectionPool field.
Parameter | Type | Required | Description | Default value |
| int32 | No | The maximum number of HTTP1 or TCP connections to a destination host. The limit on the number of connections takes effect on sidecar proxies on both the client and server sides. A single client pod cannot initiate more than the configured number of connections to the server. A single server cannot accept more than the configured number of connections. The number of connections that can be accepted by the application services on the server is calculated by using the following formula: min(Number of client pods, number of server pods) × maxConnections. | 2³²-1 |
| int32 | No | The maximum number of requests that will be queued when they are waiting for a ready connection pool connection. | 1024 |
| int32 | No | The maximum number of active requests to a backend service. | 1024 |
It is clear how these parameters work in a simple scenario where only one client and one destination service instance exist. In Kubernetes environments, an instance is equivalent to a pod. However, in production environments, we are more likely to see the following scenarios:
One client instance and multiple destination service instances
Multiple client instances and single destination service instance
Multiple client instances and multiple destination service instances
In different scenarios, you need to adjust the values of these parameters based on your business requirements to ensure that the connection pool can adapt to high-load and complex environments and provide good performance and reliability. The following section provides an example on how to configure a connection pool in the preceding scenarios to help you understand the constraints of the configuration on the client and the server. Then, you can configure a circuit breaking policy that applies to your production environment.
Configuration examples
In this topic, two Python scripts are created: one for the destination service (server) and the other for the calling service (client).
The server script creates a Flask application and defines a single endpoint on the root route. When you access the root route, the server sleeps for 5 seconds and then returns a
"Hello World!"
string in JSON format.The client script calls the server endpoint by sending 10 requests in parallel at a time, and then sleeps for some time before sending the next batch of 10 requests. The script does this in an infinite loop. To ensure that all of the client pods send a batch of 10 requests at the same time when multiple client pods are running, batches of 10 requests are sent at the 0th, 20th, and 40th second of every minute (according to the system time) in this example.
Deploy sample applications
Create a YAML file that contains the following content and then run the
kubectl apply -f ${name of the YAML file}.yaml
command to deploy sample applications.Run the following command to view the client and server pods:
kubectl get po |grep circuit
Expected output:
circuit-breaker-sample-client-d4f64d66d-fwrh4 2/2 Running 0 1m22s circuit-breaker-sample-server-6d6ddb4b-gcthv 2/2 Running 0 1m22s
If no limits are defined in the destination rule, the server can handle 10 concurrent requests from the client. Therefore, the response code returned by the server is always 200. The following code block shows the logs of the client:
----------Info----------
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.016539812088013
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012614488601685
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.015984535217285
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.015599012374878
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012874364852905
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.018714904785156
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.010422468185425
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012431621551514
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.011001348495483
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.01432466506958
Configure the connectionPool field
To enable circuit breaking for a destination service by using the service mesh technology, you need to only define a corresponding destination rule for the destination service.
Use the following content to create a destination rule for the sample destination service. For more information, see Manage destination rules. This destination rule limits the number of TCP connections to the destination service to 5.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: circuit-breaker-sample-server
spec:
host: circuit-breaker-sample-server
trafficPolicy:
connectionPool:
tcp:
maxConnections: 5
Scenario 1: One client pod and one pod for the destination service
Start the client pod and monitor logs.
We recommend that you restart the client to obtain more intuitive statistical results. You can see the following logs:
----------Info---------- Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.0167787075042725 Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.011920690536499 Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.017078161239624 Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.018405437469482 Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.018689393997192 Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.018936395645142 Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.016417503356934 Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.019930601119995 Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.022735834121704 Status: 200, Start: 02:49:40, End: 02:49:55, Elapsed Time: 15.02303147315979
The preceding logs show that all the requests are successful. However, only five requests in each batch are responded to in about 5 seconds. The other requests are responded to in 10 or more seconds. It implies that using only
tcp.maxConnections
results in excess requests being queued. They are waiting for connections to be freed up. By default, the number of requests that can be queued is 2³² - 1.Use the following content to update the destination rule to allow only one pending request. For more information, see Manage destination rules.
To realize circuit breaking (fail-fast), you must also set
http.http1MaxPendingRequests
to limit the number of requests that can be queued. The default value of the http1MaxPendingRequests parameter is 1024. If you set the value to 0, it falls back to the default value. Therefore, you must set the value to at least 1.apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: circuit-breaker-sample-server spec: host: circuit-breaker-sample-server trafficPolicy: connectionPool: tcp: maxConnections: 5 http: http1MaxPendingRequests: 1
Restart the client pod to obtain correct statistics and monitor logs.
Sample logs:
----------Info---------- Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.005339622497558594 Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.007254838943481445 Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.0044133663177490234 Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.008964776992797852 Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.018309116363525 Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.017424821853638 Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.019804954528809 Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.01643180847168 Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.025975227355957 Status: 200, Start: 02:56:40, End: 02:56:50, Elapsed Time: 10.01716136932373
The logs indicate that four requests were immediately throttled, five requests were sent to the destination service, and one request was queued.
Run the following command to view the number of active connections that the Istio proxy of the client establishes with the pod of the destination service:
kubectl exec $(kubectl get pod --selector app=circuit-breaker-sample-client --output jsonpath='{.items[0].metadata.name}') -c istio-proxy -- curl -X POST http://localhost:15000/clusters | grep circuit-breaker-sample-server | grep cx_active
Expected output:
outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.124:9080::cx_active::5
The output indicates that five active connections are established between the Istio proxy of the client and the pod of the destination service.
Scenario 2: One client pod and multiple pods for the destination service
This section verifies whether the connection limit is applied at the pod level or the service level. Assume that one client pod and three pods for the destination service exist.
If the connection limit is applied at the pod level, each pod of the destination service has a maximum of five connections.
In this case, no throttling or queuing is observed because the maximum connections allowed is 15 (3 pods multiplied by 5 connections per pod). Because only 10 requests are sent at a time, all requests should succeed and are responded to in about 5 seconds.
If the connection limit is applied at the service level, no matter how many pods are running for the destination service, a maximum of five connections are allowed in total.
In this case, four requests were immediately throttled, five requests were sent to the destination service, and one request was queued.
Run the following command to scale the destination service deployment to three replicas:
kubectl scale deployment/circuit-breaker-sample-server --replicas=3
Restart the client pod and monitor logs.
Sample logs:
----------Info---------- Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.011791706085205078 Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.0032286643981933594 Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.012153387069702148 Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.011871814727783203 Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.012892484664917 Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.013102769851685 Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.016939163208008 Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.014261484146118 Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.01246190071106 Status: 200, Start: 03:06:20, End: 03:06:30, Elapsed Time: 10.021712064743042
The logs indicate a similar throttling and queuing as shown in the preceding code block, which means increasing the number of instances of the destination service does not increase the connection limit for the client. This indicates that the connection limit is applied at the service level.
Run the following command to view the number of active connections that the Istio proxy of the client establishes with the pods of the destination service:
kubectl exec $(kubectl get pod --selector app=circuit-breaker-sample-client --output jsonpath='{.items[0].metadata.name}') -c istio-proxy -- curl -X POST http://localhost:15000/clusters | grep circuit-breaker-sample-server | grep cx_active
Expected output:
outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.124:9080::cx_active::2 outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.158:9080::cx_active::2 outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.26:9080::cx_active::2
The output indicates that the Istio proxy of the client establishes two active connections with each pod of the destination service. A total of six rather than five connections are established. As mentioned in both Envoy and Istio documentation, a proxy allows some leeway in terms of the number of connections.
Scenario 3: Multiple client pods and one pod for the destination service
Run the following commands to adjust the number of replicas for the destination service and the client:
kubectl scale deployment/circuit-breaker-sample-server --replicas=1 kubectl scale deployment/circuit-breaker-sample-client --replicas=3
Restart the client pod and monitor logs.
The logs indicate that the number of 503 errors on each client increases. The system allows only five concurrent requests from all the three client pods.
View the logs of the client proxies.
You can see two different types of logs for the requests that were throttled. The error code 503 is returned for such requests. The logs indicate that the
RESPONSE_FLAGS
field has two values:UO
andURX
.UO
: indicates upstream overflow (circuit breaking).URX
: indicates that the request is rejected because the retry condition for upstream HTTP requests is not met or the maximum number of TCP connection attempts is reached.
According to the values of other fields in the logs, such as
DURATION
,UPSTREAM_HOST
, andUPSTREAM_CLUSTER
, we can further obtain the following conclusion:Requests with the
UO
flag are throttled locally by the client proxies, and requests with theURX
flag are rejected by the destination service proxy.Verify the correctness of the conclusion in the previous step and check the logs of the destination service proxy.
As expected, the response code 503 appears in the logs of the destination service proxy. That is the reason why the logs of the client proxies contain
"response_code":"503"
and"response_flags":"URX"
.
In summary, the client proxies send requests according to the limit that up to five connections are allowed for each pod, and throttle or queue excess requests by using the UO
response flag. All three client proxies can send up to 15 parallel requests at the start of a batch. However, only five requests can be successfully sent because the destination service proxy also limits the number of connections to five. The destination service proxy accepts only five requests and throttles the rest. The throttled requests are marked by the URX response flag in the logs of the client proxies.
The following figure shows how requests are sent from multiple client pods to a single destination service pod in the preceding scenario.
Scenario 4: Multiple pods for both the client and the destination service
When you increase the number of replicas of the destination service, the overall success rate of requests rises because each destination service proxy allows five parallel requests. In this way, throttling on both the client proxies and the destination service proxies can be observed.
Run the following commands to increase the number of replicas of the destination service to 2 and the number of replicas of the client to 3:
kubectl scale deployment/circuit-breaker-sample-server --replicas=2 kubectl scale deployment/circuit-breaker-sample-client --replicas=3
You can see that 10 requests are successful out of the 30 requests generated by all 3 client proxies in a batch.
Run the following command to increase the number of replicas of the destination service to 3:
kubectl scale deployment/circuit-breaker-sample-server --replicas=3
You can see that 15 requests are successful.
Run the following command to increase the number of replicas of the destination service to 4:
kubectl scale deployment/circuit-breaker-sample-server --replicas=3
After the number of replicas of the destination service is increased from 3 to 4, you still see only 15 successful requests. The limit on client proxies applies to the entire destination service regardless of the number of replicas that the destination service has. Therefore, regardless of the number of replicas that the destination service has, each client proxy can send a maximum of five concurrent requests to the destination service.
Related operations
View metrics related to circuit breaking of connection pools
Circuit breaking of connection pools is implemented by limiting the maximum number of TCP connections to a destination host. When circuit breaking occurs, a series of related metrics are generated. These metrics help you determine whether circuit breaking occurs. The following table describes some metrics.
Metric | Type | Description |
envoy_cluster_circuit_breakers_default_cx_open | Gauge | Indicates whether circuit breaking is triggered for a connection pool. The value 1 indicates that circuit breaking is triggered. The value 0 indicates that circuit breaking is not triggered. |
envoy_cluster_circuit_breakers_default_rq_pending_open | Gauge | Indicates whether the number of requests that will be queued when they are waiting for a ready connection pool connection has exceeded the given value. The value is 1 if the number of requests has exceeded the given value. The value is 0 if the number of requests has not exceeded the given value. |
You can configure proxyStatsMatcher for a sidecar proxy to enable the sidecar proxy to report metrics related to circuit breaking. Then, you can use Prometheus to collect and view the metrics.
Configure proxyStatsMatcher to enable a sidecar proxy to report metrics related to circuit breaking. After you select proxyStatsMatcher, select Regular Expression Match and set the value to
.*circuit_breaker.*
. For more information, see proxyStatsMatcher.Redeploy the Deployments for circuit-breaker-sample-server and circuit-breaker-sample-client. For more information, see Redeploy workloads.
Complete the circuit breaking configuration of connection pools and perform request tests by following the preceding steps.
Run the following command to view the metrics related to circuit breaking of the connection pool for the circuit-breaker-sample-client service:
kubectl exec -it deploy/circuit-breaker-sample-client -c istio-proxy -- curl localhost:15090/stats/prometheus | grep circuit_breaker | grep circuit-breaker-sample-server
Expected output:
kubectl exec -it deploy/circuit-breaker-sample-client -c istio-proxy -- curl localhost:15090/stats/prometheus | grep circuit_breaker | grep circuit-breaker-sample-server envoy_cluster_circuit_breakers_default_cx_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 1 envoy_cluster_circuit_breakers_default_cx_pool_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_default_remaining_cx{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_default_remaining_cx_pools{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 18446744073709551613 envoy_cluster_circuit_breakers_default_remaining_pending{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 1 envoy_cluster_circuit_breakers_default_remaining_retries{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 4294967295 envoy_cluster_circuit_breakers_default_remaining_rq{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 4294967295 envoy_cluster_circuit_breakers_default_rq_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_default_rq_pending_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_default_rq_retry_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_high_cx_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_high_cx_pool_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_high_rq_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_high_rq_pending_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0 envoy_cluster_circuit_breakers_high_rq_retry_open{cluster_name="outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local"} 0
Configure metric collection and alerts for circuit breaking of connection pools
After you configure metrics related to circuit breaking of connection pools, you can configure settings to collect the metrics to Prometheus and configure alert rules based on key metrics. This way, alerts can be generated when circuit breaking occurs. The following section demonstrates how to configure metric collection and alerts for circuit breaking of connection pools. In this example, Managed Service for Prometheus is used.
In Managed Service for Prometheus, you can connect the cluster on the data plane to the Alibaba Cloud ASM component or upgrade the component to the latest version. This ensures that the exposed metrics related to circuit breaking can be collected by Managed Service for Prometheus. For more information about how to integrate components into ARMS, see Component management. (If you have configured a self-managed Prometheus instance to collect metrics of an ASM instance by referring to Monitor ASM instances by using a self-managed Prometheus instance, you do not need to perform this step.)
Create an alert rule for circuit breaking of connection pools. For more information, see Use a custom PromQL statement to create an alert rule. The following example demonstrates how to specify key parameters for configuring an alert rule. For more information about how to configure other parameters, see the preceding documentation.
Parameter | Example | Description |
Custom PromQL Statements | (sum by(cluster_name, pod_name,namespace) (envoy_cluster_circuit_breakers_default_cx_open)) != 0 | In the example, the envoy_cluster_circuit_breakers_default_cx_open metric is queried to determine whether circuit breaking is occurring in connection pools of the current cluster. Based on the hostname of the upstream service and the name of the pod that reports the metric, you can determine the location where circuit breaking occurs. |
Alert Message | Circuit breaking occurs for a connection pool. The number of TCP connections established by the sidecar proxy has reached the upper limit. Namespace: {{$labels.namespace}}, Pod in which circuit breaking occurs for a connection pool:{{$labels.pod_name}}, Information about the upstream service: {{ $labels.cluster_name }} | The alert information in the example indicates the pod in which circuit breaking for a connection pool occurs, the upstream service to which the pod connects, and the namespace to which the pod belongs. |
Constraints on connection pool configurations
The following table describes the constraints on the configurations of the connectionPool field on the client and the destination service.
Role | Description |
Client | Each client proxy implements the limit independently. If the limit on the number of requests is 100, each client proxy can have 100 outstanding requests before local throttling is applied. If N clients call the destination service, the maximum number of outstanding requests that are supported is the product of 100 and N. The limit on client proxies applies to the entire destination service, not to a single replica of the destination service. Even if the destination service runs in 200 active pods, a maximum of 100 requests are allowed. |
Destination service | The limit applies to each destination service proxy. If the service runs in 50 active pods, each pod can have up to 100 outstanding requests sent from client proxies before throttling is triggered and the response code 503 is returned. |