Configure SLOs for applications in ASM to implement application performance monitoring and management - Alibaba Cloud Service Mesh

If you need to manage and monitor the service level of an application, you can configure service level objectives (SLOs) and alert rules in the Service Mesh (ASM) console to ensure that the application runs as expected. When the service level of the application becomes equal to or lower than the preset threshold, ASM issues different levels of reminders based on the severity of the fault. This helps you manage the service level of the application more efficiently and handle issues more quickly.

Prerequisites

A Container Service for Kubernetes (ACK) cluster is added to an ASM instance of v1.15.3 or later. For more information, see The cluster is added to the ASM instance.
An ingress gateway is deployed.
Automatic sidecar proxy injection is enabled. For more information, see Enable automatic sidecar proxy injection.

Step 1: Deploy the HTTPBin application

Create an httpbin.yaml file that contains the following content:

Expand to view the httpbin.yaml file

##################################################################################################
# httpbin service
##################################################################################################
apiVersion: v1
kind: ServiceAccount
metadata:
  name: httpbin
---
apiVersion: v1
kind: Service
metadata:
  name: httpbin
  labels:
    app: httpbin
    service: httpbin
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 80
  selector:
    app: httpbin
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: httpbin
spec:
  replicas: 1
  selector:
    matchLabels:
      app: httpbin
      version: v1
  template:
    metadata:
      labels:
        app: httpbin
        version: v1
    spec:
      serviceAccountName: httpbin
      containers:
      - image: docker.io/kennethreitz/httpbin
        imagePullPolicy: IfNotPresent
        name: httpbin
        ports:
        - containerPort: 80

Use kubectl to connect to the ACK cluster and run the following command to deploy the HTTPBin application.
For more information about how to use kubectl to connect to the ACK cluster, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
```
kubectl apply -f httpbin.yaml
```

Step 2: Create a virtual service and an Istio gateway

Create an httpbin-gateway.yaml file that contains the following content:

Expand to view the httpbin-gateway.yaml file

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: httpbin-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "*"
  gateways:
  - httpbin-gateway
  http:
  - route:
    - destination:
        host: httpbin
        port:
          number: 8000

Use kubectl to connect to the ASM instance and run the following command to deploy the virtual service and Istio gateway.
For more information about how to use kubectl to connect to the ASM instance, see Use kubectl on the control plane to access Istio resources.
```
kubectl apply -f httpbin-gateway.yaml
```
In the address bar of your browser, enter http://{IP address of the ingress gateway}.
For more information about how to obtain the IP address of the ingress gateway, see Use Istio resources to route traffic to different versions of a service. If you can view the page of the HTTPBin application, the HTTPBin application is successfully deployed.

Step 3: Configure an SLO

In this example, an SLO is configured for the HTTPBin application in the default namespace to specify the service availability. The objective is 99% and the period of time during which the SLO takes effect is 30 days. Two severity levels of alerts are configured: Page and Ticket. For more information about the concepts related to SLO, see SLO overview.

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Observability Management Center > SLO Configuration.
In the upper part of the SLO Configuration page, select the default namespace from the Namespace drop-down list and click Create in the Actions column of the httpbin service.
In the Basic Information section of the Create page, set Duration to 30d.
Click the SLO rule tab. Set Name to asm-slo, Plugin type to availability, and Objective to 99. Turn on Enable alerting rules and set Alerting rules name to asm-alert. Turn on Enable alerting rule with Ticket level and Enable alerting rule with Page level.
(Optional) In the lower part of the page, click Preview to view the configurations. Confirm that the configurations are correct and click Submit.
For more information about the fields in the configuration file, see Description of SLO CRD fields.
In the lower part of the page, click Create.

Step 4: View the automatically generated Prometheus rule

After the SLO is configured, you can perform the following operations to view the automatically generated Prometheus rule: find the httpbin service on the SLO Configuration page and click View Prometheus rules in the Actions column.

查看Promethe规则

Expand to view a sample Prometheus rule

groups:
- name: asm-slo-sli-recordings-httpbin-asm-slo
  rules:
  - record: slo:sli_error:ratio_rate5m
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[5m])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[5m])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 5m
  - record: slo:sli_error:ratio_rate30m
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[30m])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[30m])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 30m
  - record: slo:sli_error:ratio_rate1h
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[1h])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[1h])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 1h
  - record: slo:sli_error:ratio_rate2h
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[2h])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[2h])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 2h
  - record: slo:sli_error:ratio_rate6h
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[6h])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[6h])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 6h
  - record: slo:sli_error:ratio_rate1d
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[1d])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[1d])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 1d
  - record: slo:sli_error:ratio_rate3d
    expr: "(\n(\n  sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\"
      }[3d])) \n  /          \n  (sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\"
      }[3d])) > 0)\n) OR on() vector(0)\n)"
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
      slo_window: 3d
  - record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}[30d])
      / ignoring (slo_window)
      count_over_time(slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}[30d])
    labels:
      slo_window: 30d
- name: asm-slo-meta-recordings-httpbin-asm-slo
  rules:
  - record: slo:objective:ratio
    expr: vector(0.99)
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
  - record: slo:error_budget:ratio
    expr: vector(1-0.99)
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
  - record: slo:time_period:days
    expr: vector(30)
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
  - record: slo:current_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}
      / on(slo_id, asm_slo, slo_service) group_left
      slo:error_budget:ratio{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
  - record: slo:period_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate30d{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}
      / on(slo_id, asm_slo, slo_service) group_left
      slo:error_budget:ratio{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
  - record: slo:period_error_budget_remaining:ratio
    expr: 1 - slo:period_burn_rate:ratio{asm_slo="asm-slo", slo_id="httpbin-asm-slo",
      slo_service="httpbin"}
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_service: httpbin
  - record: asm_slo_info
    expr: vector(1)
    labels:
      asm_slo: asm-slo
      slo_id: httpbin-asm-slo
      slo_mode: cli-gen-prom
      slo_objective: "99"
      slo_service: httpbin
      slo_spec: prometheus/v1
      slo_version: dev
- name: asm-slo-alerts-httpbin-asm-slo
  rules:
  - alert: asm-alert
    expr: |
      (
          (slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (14.4 * 0.01))
          and ignoring (slo_window)
          (slo:sli_error:ratio_rate1h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (14.4 * 0.01))
      )
      or ignoring (slo_window)
      (
          (slo:sli_error:ratio_rate30m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (6 * 0.01))
          and ignoring (slo_window)
          (slo:sli_error:ratio_rate6h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (6 * 0.01))
      )
    labels:
      slo_severity: page
    annotations:
      summary: '{{$labels.slo_service}} {{$labels.asm_slo}} SLO error budget burn
        rate is over expected.'
      title: (page) {{$labels.slo_service}} {{$labels.asm_slo}} SLO error budget burn
        rate is too fast.
  - alert: asm-alert
    expr: |
      (
          (slo:sli_error:ratio_rate2h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (3 * 0.01))
          and ignoring (slo_window)
          (slo:sli_error:ratio_rate1d{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (3 * 0.01))
      )
      or ignoring (slo_window)
      (
          (slo:sli_error:ratio_rate6h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (1 * 0.01))
          and ignoring (slo_window)
          (slo:sli_error:ratio_rate3d{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (1 * 0.01))
      )
    labels:
      slo_severity: ticket
    annotations:
      summary: '{{$labels.slo_service}} {{$labels.asm_slo}} SLO error budget burn
        rate is over expected.'
      title: (ticket) {{$labels.slo_service}} {{$labels.asm_slo}} SLO error budget
        burn rate is too fast.

What to do next

You can import the generated Prometheus rule to the Prometheus system for the SLO to take effect and use Grafana to view SLO-related metrics. For more information, see Import the generated Prometheus rule to the Prometheus system for the SLOs to take effect and Use Grafana to view SLOs.