全部產品
Search
文件中心

Container Service for Kubernetes:通過自建Prometheus採集控制面組件指標並配置警示

更新時間:Aug 06, 2025

ACK託管叢集Pro版對外透出了控制面組件的監控指標並提供了組件監控大盤。您可以參見本文在自建的Prometheus中採集叢集核心組件指標,並基於指標配置監控警示,實現與自建監控系統的整合。

使用須知

配置Prometheus採集檔案

使用自建Prometheus採集叢集控制面組件指標前,需要在Prometheus的設定檔prometheus.yaml中配置對應的指標採集Job。樣本設定檔中,每個核心組件對應一個Job配置,具體配置可參見對應核心組件指標說明文檔。

關於如何配置社區Prometheus的prometheus.yaml,請參見Configuration
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: ack-api-server
    ......

  - job_name: ack-etcd
    ......

  - job_name: ack-scheduler
    ......

            

社區Prometheus Operator方案以及ACK應用市場ack-prometheus-operator組件的相關資訊,請參見開源Prometheus監控。關於自訂採集配置,請參見Prometheus Operator社區官方文檔Prometheus Operator進行資料擷取配置。

配置Prometheus警示規則

關於如何為開源Prometheus警示配置,請參見Alerting_rules

叢集內部監控

如果您的Prometheus部署在待監控的叢集內部,您可以參見下文完成叢集核心組件的監控和資料擷取。

kube-apiserver

請參見kube-apiserver組件監控指標說明瞭解監控採集指標清單。

針對自2023年02月起建立的、1.20及以上版本的叢集,訪問 default 命名空間中的 kubernetes 服務時,服務訪問路徑已從傳統型負載平衡(CLB)轉寄升級為彈性網卡(ENI)直連架構,詳情請參見Kube API Server。變更後,kube-apiserver全部副本對資料面可見。您可以配置監控採集任務直接採集kube-apiserver指標,採集鏈路更直接,指標覆蓋更全面。

您可執行命令kubectl get endpoints kubernetes判斷叢集kubernetes Service的後端鏈路類型。

展開查看預期輸出

  • ENI直連架構:預期輸出顯示 2 個及以上IP地址(如 a.b.c.d:6443,w.x.y.z:6443)。

    NAME         ENDPOINTS                               AGE
    kubernetes   a.b.c.d:6443,w.x.y.z:6443               27h
  • CLB轉寄架構:預期輸出僅顯示 1 個IP地址(如 a.b.c.d:6443),該IP為CLB的內網IP地址。

    NAME         ENDPOINTS                               AGE
    kubernetes   a.b.c.d:6443                            27h

請根據叢集Kubernetes服務的後端鏈路類型選擇Prometheus採集配置和警示規則。

  • Prometheus採集配置

    • ENI直連架構

      - job_name: ack-api-server  
        scrape_interval: 30s
        scrape_timeout: 30s
        metrics_path: /metrics
        scheme: https
        #  scheme: https
        honor_labels: true
        honor_timestamps: true
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names: [default]
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        tls_config:
          insecure_skip_verify: false
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          server_name: kubernetes
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_label_component]
          separator: ;
          regex: apiserver
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_service_label_provider]
          separator: ;
          regex: kubernetes
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_endpoint_port_name]
          separator: ;
          regex: https
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_namespace]
          separator: ;
          regex: (.*)
          target_label: namespace
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
          separator: ;
          regex: Node;(.*)
          target_label: node
          replacement: ${1}
          action: replace
        - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
          separator: ;
          regex: Pod;(.*)
          target_label: pod
          replacement: ${1}
          action: replace
        - source_labels: [__meta_kubernetes_service_name]
          separator: ;
          regex: (.*)
          target_label: service
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_service_name]
          separator: ;
          regex: (.*)
          target_label: job
          replacement: ${1}
          action: replace
        - source_labels: [__meta_kubernetes_service_label_component]
          separator: ;
          regex: (.+)
          target_label: job
          replacement: ${1}
          action: replace
        - separator: ;
          regex: (.*)
          target_label: endpoint
          replacement: https
          action: replace
    • CLB轉寄架構

      - job_name: ack-api-server  
        scrape_interval: 30s
        scrape_timeout: 30s
        metrics_path: /metrics
        scheme: https
        #  scheme: https
        honor_labels: true
        honor_timestamps: true
        params:
          hosting: ["true"]
          job: ["apiserver"]
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names: [default]
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        tls_config:
          insecure_skip_verify: false
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          server_name: kubernetes
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_label_component]
          separator: ;
          regex: apiserver
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_service_label_provider]
          separator: ;
          regex: kubernetes
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_endpoint_port_name]
          separator: ;
          regex: https
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_namespace]
          separator: ;
          regex: (.*)
          target_label: namespace
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
          separator: ;
          regex: Node;(.*)
          target_label: node
          replacement: ${1}
          action: replace
        - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
          separator: ;
          regex: Pod;(.*)
          target_label: pod
          replacement: ${1}
          action: replace
        - source_labels: [__meta_kubernetes_service_name]
          separator: ;
          regex: (.*)
          target_label: service
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_service_name]
          separator: ;
          regex: (.*)
          target_label: job
          replacement: ${1}
          action: replace
        - source_labels: [__meta_kubernetes_service_label_component]
          separator: ;
          regex: (.+)
          target_label: job
          replacement: ${1}
          action: replace
        - separator: ;
          regex: (.*)
          target_label: endpoint
          replacement: https
          action: replace
  • Prometheus警示規則

    - alert: AckApiServerWarning
      annotations:
        message:  APIServer is not available in last 5 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1
      for: 5m
      labels:
        severity: critical

etcd

請參見etcd組件監控指標說明瞭解監控採集指標清單。
  • Prometheus採集配置

    - job_name: ack-etcd 
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["etcd"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      authorization:
        credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        insecure_skip_verify: false
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server_name: kubernetes
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - separator: ;
        regex: (.*)
        target_label: endpoint
        replacement: https
        action: replace
  • Prometheus警示規則

    - alert: AckETCDWarning
      annotations:
        message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team.
      expr: |
        sum_over_time(etcd_server_has_leader[5m]) == 0
      for: 5m
      labels:
        severity: critical
    
    
    - alert: AckETCDWarning
      annotations:
        message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1
      for: 5m
      labels:
        severity: critical

kube-scheduler

請參見kube-scheduler組件監控指標說明瞭解監控採集指標清單。
  • Prometheus採集配置

    - job_name: ack-scheduler
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["ack-scheduler"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      authorization:
        credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        insecure_skip_verify: false
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server_name: kubernetes
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - separator: ;
        regex: (.*)
        target_label: endpoint
        replacement: https
        action: replace
  • Prometheus警示規則

    - alert: AckSchedulerWarning
      annotations:
        message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1
      for: 3m
      labels:
        severity: critical

kube-controller-manager

請參見kube-controller-manager組件監控指標說明瞭解監控採集指標清單。
  • Prometheus採集配置

    - job_name: ack-kcm
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["ack-kube-controller-manager"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      authorization:
        credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        insecure_skip_verify: false
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server_name: kubernetes
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - separator: ;
        regex: (.*)
        target_label: endpoint
        replacement: https
        action: replace
  • Prometheus警示規則

    - alert: AckKCMWarning
      annotations:
        message: KCM is not available in last 3 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-kcm",pod!=""})or(count(up{job="ack-kcm",pod!=""})<=0))>=1
      for: 3m
      labels:
        severity: critical

cloud-controller-manager

請參見cloud-controller-manager組件監控指標說明瞭解監控採集指標清單。
  • Prometheus採集配置

    - job_name: ack-cloud-controller-manager
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: https
      #  scheme: https
      honor_labels: true
      honor_timestamps: true
      params:
        hosting: ["true"]
        job: ["ack-cloud-controller-manager"]
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [default]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
                   insecure_skip_verify: false}
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: apiserver
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_service_label_provider]
        separator: ;
        regex: kubernetes
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: https
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        separator: ;
        regex: (.*)
        target_label: namespace
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Node;(.*)
        target_label: node
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        regex: Pod;(.*)
        target_label: pod
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: service
        replacement: $1
        action: replace
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: (.*)
        target_label: job
        replacement: ${1}
        action: replace
      - source_labels: [__meta_kubernetes_service_label_component]
        separator: ;
        regex: (.+)
        target_label: job
        replacement: ${1}
        action: replace
      - separator: ;
        regex: (.*)
        target_label: endpoint
        replacement: https
        action: replace
  • Prometheus警示規則

    - alert: AckCCMWarning
      annotations:
        message: CCM is not available in last 3 minutes. Please check the prometheus job and target status.
      expr: |
        (absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1
      for: 3m
      labels:
        severity: critical

叢集外部監控

如果您的Prometheus部署在待監控的叢集外部,請參見ConfigurationMonitoring kubernetes with prometheus from outside of k8s cluster完成叢集核心組件的監控和資料擷取。主要配置如下。

  - job_name: 'out-of-k8s-scrape-job'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/kubernetes-ca.crt
    bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'

    kubernetes_sd_configs:
      - api_server: 'https://<KUBERNETES URL>'
        role: node
        tls_config:
          ca_file: /etc/prometheus/kubernetes-ca.crt
        bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
            

驗證效果

  1. 登入自建的Prometheus控制台,切換到Graph頁面。

  2. 輸入up,查看是否全部控制面組件資料顯示正常。

    up

    預期輸出:

    自訂

    • up{instance="XX.XX.XX.XX:6443", job="ack-api-server"}:代理Endpoint狀態。其中,XX.XX.XX.XX是叢集default命名空間下kubernetes Service的IP,不同叢集對應的IP不同。

    • up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}:控制面Pod的狀態。該up指標可用於對控制面Pod進行探活檢測。

  3. 輸入以下指標,查看是否可以正常顯示。

    apiserver_request_total{job="ack-api-server"}

    預期輸出:

    顯示2

    如果介面能正常顯示查詢的指標和資料,表明自建Prometheus可以正常採集核心組件指標。