全部產品
Search
文件中心

Container Service for Kubernetes:使用者自建Prometheus對控制平面組件的監控和警示

更新時間:Jun 19, 2024

本文介紹如何基於使用者自建的Prometheus,採集ACK Pro叢集的控制平面組件監控APIServer、etcd、Scheduler、KCM、CCM指標配置,並介紹推薦的警示配置。

前提條件

  • 自建的Prometheus能夠訪問ACK Pro叢集的APIServer,並擁有/metrics的讀許可權。

  • 自建的Prometheus可以在ACK Pro叢集內,也可以在ACK Pro叢集外。

背景資訊

ACK Pro提供控制平面核心組件監控對外透出的功能,並基於ARMS預置了相關的組件監控大盤,具體包括APIServer、Cloud Controller Manager、etcd、Kube Controller Manager和Scheduler,如果您選用了ARMS監控能力,監控資料會被ARMS代理自動採集並在監控大盤上即時展示。如果您希望通過自建Prometheus採集ACK Pro叢集的控制平面核心組件指標並配置相應警示,實現與自建監控系統的整合,可以基於本文進行配置。

Prometheus採集配置

使用自建的Prometheus採集ACK Pro叢集控制平面核心組件指標,首先需要在Prometheus的設定檔prometheus.yaml中配置指標採集Job,設定檔格式如下:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: ack-api-server
    ......

  - job_name: ack-etcd
    ......

  - job_name: ack-scheduler
    ......

            

其中,每個核心組件對應一個Job配置,具體配置可參見對應核心組件的指標清單。社區Prometheus配置Prometheus.yaml方法,請參見Configuration

社區Prometheus Operator方案以及ACK應用市場ack-prometheus-operator組件的相關資訊,請參見開源Prometheus監控。關於自訂採集配置,請參見Prometheus Operator社區官方文檔Prometheus Operator進行資料擷取配置。

Prometheus警示規則配置

社區Prometheus警示配置具體操作,請參見Alerting_rules

ACK Pro叢集內部監控

內部監控是將Prometheus部署在待監控的ACK Pro叢集內的監控形式。

kube-apiserver

關於kube-apiserver組件的更多資訊,請參見kube-apiserver組件監控

  • Prometheus採集配置

    - job_name: ack-api-server  
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["apiserver"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
                   insecure_skip_verify: false}
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
  • Prometheus警示規則

    - alert: AckApiServerWarning
      annotations:
        message:  APIServer is not available in last 5 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1
      for: 5m
      labels:
        severity: critical
  • 監控採集指標清單

    kube-apiserver監控採集指標清單,請參見kube-apiserver指標清單

etcd

關於etcd組件的更多資訊,請參見etcd組件監控

  • Prometheus採集配置

    - job_name: ack-etcd 
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["etcd"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
                   insecure_skip_verify: false}
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
  • Prometheus警示規則

    - alert: AckETCDWarning
      annotations:
        message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team.
      expr: |
        sum_over_time(etcd_server_has_leader[5m]) == 0
      for: 5m
      labels:
        severity: critical
    
    
    - alert: AckETCDWarning
      annotations:
        message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1
      for: 5m
      labels:
        severity: critical
  • 監控採集指標清單

    etcd監控採集指標清單,請參見etcd指標清單

kube-scheduler

關於kube-scheduler組件的更多資訊,請參見kube-scheduler組件監控

  • Prometheus採集配置

    - job_name: ack-scheduler
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["ack-scheduler"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
                   insecure_skip_verify: false}
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
  • Prometheus警示規則

    - alert: AckSchedulerWarning
      annotations:
        message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1
      for: 3m
      labels:
        severity: critical
  • 監控採集指標清單

    kube-scheduler監控採集指標清單,請參見kube-scheduler指標清單

kube-controller-manager

關於kube-controller-manager組件的更多資訊,請參見kube-controller-manager組件監控

  • Prometheus採集配置

    - job_name: ack-kcm
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["ack-kube-controller-manager"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
                   insecure_skip_verify: false}
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
  • Prometheus警示規則

    - alert: AckKCMWarning
      annotations:
        message: KCM is not available in last 3 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-kcm",pod!=""})or(count(up{job="ack-kcm",pod!=""})<=0))>=1
      for: 3m
      labels:
        severity: critical
  • 監控採集指標清單

    kube-controller-manager監控採集指標清單,請參見kube-controller-manager指標清單

cloud-controller-manager

關於cloud-controller-manager組件的更多資訊,請參見cloud-controller-mananger組件監控

  • Prometheus採集配置

    - job_name: ack-cloud-controller-manager
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["ack-cloud-controller-manager"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
                   insecure_skip_verify: false}
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
  • Prometheus警示規則

    - alert: AckCCMWarning
      annotations:
        message: CCM is not available in last 3 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1
      for: 3m
      labels:
        severity: critical
  • 監控採集指標清單

    cloud-controller-manager監控採集指標清單,請參見cloud-controller-manager指標清單

ACK Pro叢集外部監控

如果需要使用ACK Pro叢集外的Prometheus來監控Kubernetes叢集,具體操作,請參見ConfigurationMonitoring kubernetes with prometheus from outside of k8s cluster。主要配置如下:

  - job_name: 'out-of-k8s-scrape-job'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/kubernetes-ca.crt
    bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'

    kubernetes_sd_configs:
      - api_server: 'https://<KUBERNETES URL>'
        role: node
        tls_config:
          ca_file: /etc/prometheus/kubernetes-ca.crt
        bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
            

驗證效果

  1. 登入自建的Prometheus控制台,切換到Graph頁面。

  2. 輸入up,查看是否全部控制平面組件都可以顯示。

    up

    預期輸出:

    自訂

    重要
    • up{instance="x.x.x.x:6443", job="ack-api-server"}是作為代理的Endpoint狀態。其中,x.x.x.x是K8s叢集default命名空間下Kubernetes Service的IP,不同使用者叢集該IP不同。

    • up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}是具體控制面Pod的狀態。可以使用該up指標為控制面Pod做探活檢測。

  3. 輸入以下指標,查看是否可以正常顯示。

    apiserver_request_total{job="ack-api-server"}

    預期輸出:

    顯示2

    如果介面能正常顯示查詢的指標和資料,說明自建Prometheus可以正常採集控制平面核心組件指標。