By Xining Wang (xining.wxn@alibaba-inc.com)
This is the first article in the series:
A service level indicator (SLI) is a metric that measures service health. An SLO is an objective or a range of objectives that a service needs to achieve. An SLO consists of one or more SLIs.
SLOs provide a formal way to describe, measure, and monitor the performance, quality, and reliability of microservice-oriented applications. SLOs are a shared quality benchmark for application developers, platform operators, and O&M personnel. They can use SLOs as a reference to measure and continuously improve the service quality. An SLO that consists of multiple SLIs helps describe the service health in a more accurate way.
Examples of SLOs:
• Average queries per second (QPS) > 100,000/s
• Latency of 99% access requests < 500 ms
• Bandwidth per minute for 99% access requests > 200 MB/s
Alibaba Cloud Service Mesh (ASM) provides out-of-the-box monitoring and alerting capabilities based on service level objectives (SLOs). You can monitor the performance metrics of calls between application services, such as the latency and error rate.
ASM supports the following types of SLIs:
• Service availability: indicates the proportion of access requests that are successfully responded. The plug-in type for this SLI type is availability. If the HTTP status code returned to an access request is 429 or 5XX, the access request is not successfully responded. 5XX means that the status code starts with 5.
• Latency: indicates the time required for the service to return a response to a request. The plug-in type for this SLI type is latency. You can specify the maximum latency. Responses that are returned later than the specified period of time are considered unqualified.
In addition to defining the SLI type, you need to set objectives. Objectives should be reasonable. For example, if your users cannot tell the difference between 200 ms and 600 ms latencies, set the latency objective to 600ms.
You also need to consider user requirements when you set objectives. Different applications need to achieve different objectives, which depend on user requirements for the applications. For example, users may require that some non-critical service systems meet the 99% availability objective and that critical service systems meet the 99.999% availability objective. 99% availability allows about three days of downtime per year and 99.999% availability about five minutes of downtime per year.
You need to specify a period of time for an SLO, during which SLIs are measured. For example, a 99% availability objective that takes effect in one day and a same objective that takes effect in a month are different. A 99% availability objective that takes effect in one day does not allow continuous downtime of more than 14 minutes (24 hours x 1%). A 99% availability objective that takes effect in a month allows continuous downtime of up to about 7 hours (30 days x 1%).
To simplify the configuration, the compliance period can only be 7, 14, 28, and 30 days.
Another important monitoring concept is the error budget. Error budgets indicate an allowance for a certain amount of failure or technical debt within a SLO. Therefore, the error budget can be expressed as (1-SLO).
Example:
• SLI Error: Request Status Code >= 500
• Duration: 30 days
• SLO: 99.9%
• Error budget: (100% - 99.9%) = 0.1%
• Total requests in 30 days: 10000
• Error requests allowed: (10000 * 0.1%) = 10
To achieve the SLO, a maximum of 10 error requests are allowed every 30 days. You can better plan and manage tasks by referring to the error budget. For example, you can decide when to deploy a new version based on the error budget:
• When you are about to run out of the error budget, we recommend that you do not deploy a new version.
• Perform the version update when the error budget is sufficient at the end of the compliance period. In this case, the probability of SLO violations is low.
The error budget is updated over a rolling window that has the same time span as the compliance period.
• If the error budget is greater than or equal to 0, it indicates that the SLO is achieved in the compliance period.
• If the error budget is smaller than 0, it indicates that SLO violations occur in the compliance period.
The burn rate indicates the speed at which your error budget is being consumed. The burn rate is the ratio of the current error rate to the specified error budget.
Calculation formula: Burn rate = Error rate/(1 - SLO)
Assume that the compliance period is 30 days:
• A burn rate of 1: If the error rate is kept at the current level, 100% of the error budget will be used during the entire compliance period. That is, the error budget is used up within 30 days.
• A burn rate of 2: If the error rate is kept at the current level, 200% of the error budget will be used during the entire compliance period. That is, the error budget is used up within 15 days.
• A burn rate of 60: If the error rate is kept at the current level, 6000% of the error budget will be used during the entire compliance period. That is, the error budget is used up within 12 hours.
A higher burn rate indicates more severe faults. Alert rules are configured based on the burn rate. Alerts are triggered if the burn rate reaches the specified value.
Alert rules allow you to receive different levels of alerts based on the severity of faults. This way, you can handle the faults in a timely manner to prevent the error budget from being excessively consumed.
ASM allows you to configure alert rules that specify different burn rate thresholds for different time windows. This configuration method applies to most scenarios. Both a high error rate in a short period of time or a low error rate in a long period time trigger alerts. Meanwhile, only a high error rate in a short period of time or a low error rate in a long period time trigger alerts. This prevents O&M personnel from missing critical issues because unnecessary alerts distract their attentions. You can set a short time window when you calculate the error rate in a period of time. When the error rate in the short time window is lower than the specified threshold, the alert is ended. The configuration of a short time window ensures that alerts can be cleared in a timely manner.
Example of a short time window for a 30-day SLO:
• If 2% of the error budget is consumed within one hour or 5% of the error budget is consumed within six hours, a Page-level alert is triggered. That is, if the error rate is 14.4 times or six times of the specified threshold, a Page-level alert is triggered. If 10% of the error budget is consumed within one day or three days, a Ticket-level alert is triggered. That is, if the error rate is 3 times of the threshold or reaches the threshold, a Ticket-level alert is triggered.
• The short time window is set to 1/12.
Assume that the error rate remains twice the threshold for 3 days, and the fault is fixed on the third day. The configuration of the short time window enables the alert to be cleared six hours later. If no short time window is configured, the alert lasts for 3 days even if no fault exists.
Simplification and Extension of Alibaba Cloud Service Mesh Based on Wasm and ORAS
Configure SLO for Application Service in Alibaba Cloud Service Mesh (2): SLO Definition in ASM
56 posts | 8 followers
FollowXi Ning Wang(王夕宁) - April 8, 2023
Xi Ning Wang(王夕宁) - April 8, 2023
Xi Ning Wang(王夕宁) - April 8, 2023
Xi Ning Wang(王夕宁) - April 8, 2023
Alibaba Cloud Community - April 14, 2023
Alibaba Developer - May 21, 2021
56 posts | 8 followers
FollowAlibaba Cloud Service Mesh (ASM) is a fully managed service mesh platform that is compatible with Istio.
Learn MoreA PaaS platform for a variety of application deployment options and microservices solutions to help you monitor, diagnose, operate and maintain your applications
Learn MoreProvides comprehensive quality assurance for the release of your apps.
Learn MoreApplication High Available Service is a SaaS-based service that helps you improve the availability of your applications.
Learn MoreMore Posts by Xi Ning Wang(王夕宁)