All Products
Search
Document Center

Alibaba Cloud Service Mesh:SLO overview

Last Updated:Mar 11, 2026

Service Mesh (ASM) provides built-in monitoring and alerting based on service level objectives (SLOs). SLO monitoring tracks call performance between application services -- including latency and error rate -- and triggers alerts when reliability degrades.

Key concepts

Service level indicator (SLI) A quantitative metric that measures service health. For example: the percentage of requests that return a successful response, or the percentage of requests served within a latency threshold.

Service level objective (SLO) A target value or range for an SLI over a defined period. An SLO consists of one or more SLIs and serves as a shared reliability benchmark for application developers, platform operators, and O&M personnel to measure and continuously improve service quality.

Error budget The allowable amount of failure or technical debt derived from the SLO target. Calculated as 100% - SLO target. An SLO of 99.9% gives an error budget of 0.1%.

Burn rate The speed at which the error budget is consumed. A burn rate of 1 means the error budget will be fully consumed by the end of the compliance period. Higher burn rates indicate faster depletion and more severe issues.

SLI types

ASM supports two SLI types:

SLI typeWhat it measuresPlug-in typeFailure condition
Service availabilityDid the service respond successfully?availabilityHTTP status code is 429 or 5XX (any code starting with 5)
LatencyHow long did the service take to respond?latencyResponse time exceeds the specified maximum latency

Set realistic objectives

An SLO that consists of multiple SLIs describes service health more accurately than a single SLI alone.

Examples of SLOs:

  • Average queries per second (QPS) > 100,000/s

  • Latency of 99% of requests < 500 ms

  • Bandwidth per minute for 99% of requests > 200 MB/s

When setting objectives, focus on what users actually perceive. If your users cannot distinguish between 200 ms and 600 ms latency, set the latency objective to 600 ms -- a tighter target consumes engineering effort without improving the user experience.

Different services warrant different targets. For example:

Availability targetApproximate downtime per yearTypical use case
99%~3 daysNon-critical services
99.999%~5 minutesMission-critical systems

Compliance period

The compliance period defines the time window over which SLIs are measured against the SLO target. The same target percentage means very different things depending on the compliance period:

  • 99% availability over 1 day: no more than ~14 minutes of continuous downtime (24 hours x 1%)

  • 99% availability over 30 days: up to ~7 hours of continuous downtime (30 days x 1%)

ASM supports compliance periods of 7, 14, 28, and 30 days.

Error budget

The error budget quantifies how much unreliability a service can tolerate while still meeting its SLO:

Error budget = 100% - SLO target

Worked example

ParameterValue
SLI failure definitionHTTP status code is 429 or 5XX
Compliance period30 days
SLO target99.9%
Error budget0.1% (100% - 99.9%)
Total requests in 30 days10,000
Allowed error requests10 (10,000 x 0.1%)

To meet this SLO, the service must have no more than 10 failed requests in a 30-day window.

Use error budgets to guide decisions

The error budget is updated over a rolling window with the same duration as the compliance period:

  • Error budget >= 0: The SLO is met during the compliance period.

  • Error budget < 0: The SLO is violated.

Use the remaining error budget to decide when to deploy changes:

  • Budget nearly exhausted: Postpone deployments. The risk of an SLO violation is high.

  • Budget sufficient at the end of the compliance period: Deploy with confidence. Even if the new version introduces some errors, the SLO is unlikely to be violated.

Burn rate

Burn rate measures how fast the error budget is consumed relative to the compliance period. It is the ratio of the current error rate to the error budget:

Burn rate = Error rate / (1 - SLO target)

A burn rate of 1 means the error budget will be fully consumed exactly at the end of the compliance period. A burn rate of 2 means the budget will run out in half the time.

Burn rate examples (30-day compliance period)

Burn rateBudget consumed per compliance periodTime to exhaust budget
1100%30 days
2200%15 days
606,000%12 hours

Higher burn rates indicate more severe faults and trigger higher-severity alerts.

Alert rules

Alert rules notify you when the error budget is consumed too quickly, so that you can respond before an SLO violation occurs.

ASM uses multi-window burn rate alerting, which combines a long window for detection with a short window for auto-resolution. This approach catches both sharp spikes and slow-building issues, while clearing alerts promptly after recovery.

How multi-window alerting works

Each alert rule defines two windows:

  1. Long window: Detects when the burn rate exceeds the threshold over a longer period. This triggers the alert.

  2. Short window (1/12 of the long window): Checks whether the elevated error rate persists. When the error rate drops below the threshold in the short window, the alert clears automatically.

This design prevents two common problems:

  • Missing a gradual increase that depletes the error budget over days

  • Keeping alerts active long after a fault is resolved

Alert severity levels (30-day SLO example)

SeverityConditionBurn rateResponse
Page-level2% of error budget consumed in 1 hour14.4xImmediate action required
Page-level5% of error budget consumed in 6 hours6xImmediate action required
Ticket-level10% of error budget consumed in 1 day3xTrack via ticket
Ticket-level10% of error budget consumed in 3 days1x (threshold)Track via ticket

Short window in practice

Suppose the error rate stays at twice the threshold for 3 days, and the fault is fixed on day 3:

  • With a short window: The alert clears ~6 hours after the fix, because the short window detects the recovery.

  • Without a short window: The alert persists for 3 more days, even though the fault no longer exists.