Service Mesh (ASM) provides out-of-the-box monitoring and alerting capabilities based on service level objectives (SLOs). You can monitor the performance metrics of calls between application services, such as the latency and error rate. This topic describes SLO-related concepts.
SLI and SLO
A service level indicator (SLI) is a metric that measures service health. An SLO is an objective or a range of objectives that a service needs to achieve. An SLO consists of one or more SLIs.
SLOs provide a formal way to describe, measure, and monitor the performance, quality, and reliability of microservice-oriented applications. SLOs are a shared quality benchmark for application developers, platform operators, and O&M personnel. They can use SLOs as a reference to measure and continuously improve the service quality. An SLO that consists of multiple SLIs helps describe the service health in a more accurate way.
Examples of SLOs:
Average queries per second (QPS) > 100,000/s
Latency of 99% access requests < 500 ms
Bandwidth per minute for 99% access requests > 200 MB/s
SLI types and objectives
ASM supports the following SLI types:
Service availability: indicates the proportion of access requests that are successfully responded. The plug-in type for this SLI type is availability. If the HTTP status code returned to an access request is 429 or 5XX, the access request is not successfully responded. 5XX means that the status code starts with 5.
Latency: indicates the time required for the service to return a response to a request. The plug-in type for this SLI type is latency. You can specify the maximum latency. Responses that are returned later than the specified period of time are considered unqualified.
In addition to defining the SLI type, you need to set objectives. Objectives should be reasonable. For example, if your users cannot tell the difference between 200 ms and 600 ms latencies, set the latency objective to 600ms.
You also need to consider user requirements when you set objectives. Different applications need to achieve different objectives, which depend on user requirements for the applications. For example, users may require that some non-critical service systems meet the 99% availability objective and that critical service systems meet the 99.999% availability objective. 99% availability allows about three days of downtime per year and 99.999% availability about five minutes of downtime per year.
Compliance period
You need to specify a period of time for an SLO, during which SLIs are measured. For example, a 99% availability objective that takes effect in one day and a same objective that takes effect in a month are different. A 99% availability objective that takes effect in one day does not allow continuous downtime of more than 14 minutes (24 hours x 1%). A 99% availability objective that takes effect in a month allows continuous downtime of up to about 7 hours (30 days x 1%).
To simplify the configuration, the compliance period can only be 7, 14, 28, and 30 days.
Error budget
Error budgets indicate an allowance for a certain amount of failure or technical debt within a SLO. Therefore, you can calculate the error budget by using the following formula: Error budget = 100% - SLO. In the formula, the SLO is the percentage to achieve.
Example:
SLI failure: status code returned to requests
Compliance period: 30 days
SLO: 99.9%
Error budget: 0.1% (100% - 99.9%)
Total number of requests within 30 days: 10000
Number of error requests allowed: 10 (10000 * 0.1%)
To achieve the SLO, a maximum of 10 error requests are allowed every 30 days. You can better plan and manage tasks by referring to the error budget. For example, you can decide when to deploy a new version based on the error budget:
When you are about to run out of the error budget, we recommend that you do not deploy a new version.
Perform the version update when the error budget is sufficient at the end of the compliance period. In this case, the probability of SLO violations is low.
The error budget is updated over a rolling window that has the same time span as the compliance period.
If the error budget is greater than or equal to 0, it indicates that the SLO is achieved in the compliance period.
If the error budget is smaller than 0, it indicates that SLO violations occur in the compliance period.
Burn rate
The burn rate indicates the speed at which your error budget is being consumed. The burn rate is the ratio of the current error rate to the specified error budget. A higher burn rate indicates more severe faults. Alert rules are configured based on the burn rate. Alerts are triggered if the burn rate reaches the specified value.
Calculation formula: Burn rate = Error rate/(1 - SLO)
Assume that the compliance period is 30 days:
A burn rate of 1: If the error rate is kept at the current level, 100% of the error budget will be used during the entire compliance period. That is, the error budget is used up within 30 days.
A burn rate of 2: If the error rate is kept at the current level, 200% of the error budget will be used during the entire compliance period. That is, the error budget is used up within 15 days.
A burn rate of 60: If the error rate is kept at the current level, 6000% of the error budget will be used during the entire compliance period. That is, the error budget is used up within 12 hours.
Alert rule
Alert rules allow you to receive different levels of alerts based on the severity of faults. This way, you can handle the faults in a timely manner to prevent the error budget from being excessively consumed.
ASM allows you to configure alert rules that specify different burn rate thresholds for different time windows. This configuration method applies to most scenarios. Both a high error rate in a short period of time or a low error rate in a long period time trigger alerts. Meanwhile, only a high error rate in a short period of time or a low error rate in a long period time trigger alerts. This prevents O&M personnel from missing critical issues because unnecessary alerts distract their attentions. You can set a short time window when you calculate the error rate in a period of time. When the error rate in the short time window is lower than the specified threshold, the alert is ended. The configuration of a short time window ensures that alerts can be cleared in a timely manner.
Example of a short time window for a 30-day SLO:
If 2% of the error budget is consumed within one hour or 5% of the error budget is consumed within six hours, a Page-level alert is triggered. That is, if the error rate is 14.4 times or six times of the specified threshold, a Page-level alert is triggered. If 10% of the error budget is consumed within one day or three days, a Ticket-level alert is triggered. That is, if the error rate is 3 times of the threshold or reaches the threshold, a Ticket-level alert is triggered.
The short time window is set to 1/12.
Assume that the error rate remains twice the threshold for 3 days, and the fault is fixed on the third day. The configuration of the short time window enables the alert to be cleared six hours later. If no short time window is configured, the alert lasts for 3 days even if no fault exists.