使用須知
使用此功能時,請確保自建Prometheus能夠訪問ACK託管叢集Pro版的API Server,且擁有/metrics的讀許可權。
自建Prometheus可以部署在叢集內,也可以部署在叢集外。
ACK託管叢集Pro版對外透出了kube-apiserver、etcd、kube-scheduler、kube-controller-manager以及cloud-controller-manager託管組件的監控指標。使用本功能前,建議您參見以下文檔瞭解組件對外透出的指標及其說明:
您也可以在叢集中使用阿里雲Prometheus監控。阿里雲Prometheus會監控和自動採集資料,並提供即時的Grafana大盤,也支援為監控任務建立警示,通過郵件、簡訊、DingTalk等渠道即時接收警示。
配置Prometheus採集檔案
使用自建Prometheus採集叢集控制面組件指標前,需要在Prometheus的設定檔prometheus.yaml中配置對應的指標採集Job。樣本設定檔中,每個核心組件對應一個Job配置,具體配置可參見對應核心組件指標說明文檔。
關於如何配置社區Prometheus的prometheus.yaml,請參見Configuration。
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: ack-api-server
......
- job_name: ack-etcd
......
- job_name: ack-scheduler
......
社區Prometheus Operator方案以及ACK應用市場ack-prometheus-operator組件的相關資訊,請參見開源Prometheus監控。關於自訂採集配置,請參見Prometheus Operator社區官方文檔Prometheus Operator進行資料擷取配置。
叢集內部監控
如果您的Prometheus部署在待監控的叢集內部,您可以參見下文完成叢集核心組件的監控和資料擷取。
kube-apiserver
請參見kube-apiserver組件監控指標說明瞭解監控採集指標清單。
針對自2023年02月起建立的、1.20及以上版本的叢集,訪問 default 命名空間中的 kubernetes 服務時,服務訪問路徑已從傳統型負載平衡(CLB)轉寄升級為彈性網卡(ENI)直連架構,詳情請參見Kube API Server。變更後,kube-apiserver全部副本對資料面可見。您可以配置監控採集任務直接採集kube-apiserver指標,採集鏈路更直接,指標覆蓋更全面。
您可執行命令kubectl get endpoints kubernetes判斷叢集kubernetes Service的後端鏈路類型。
展開查看預期輸出
ENI直連架構:預期輸出顯示 2 個及以上IP地址(如 a.b.c.d:6443,w.x.y.z:6443)。
NAME ENDPOINTS AGE
kubernetes a.b.c.d:6443,w.x.y.z:6443 27h
CLB轉寄架構:預期輸出僅顯示 1 個IP地址(如 a.b.c.d:6443),該IP為CLB的內網IP地址。
NAME ENDPOINTS AGE
kubernetes a.b.c.d:6443 27h
請根據叢集Kubernetes服務的後端鏈路類型選擇Prometheus採集配置和警示規則。
Prometheus採集配置
ENI直連架構
- job_name: ack-api-server
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replace
CLB轉寄架構
- job_name: ack-api-server
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["apiserver"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replace
Prometheus警示規則
- alert: AckApiServerWarning
annotations:
message: APIServer is not available in last 5 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1
for: 5m
labels:
severity: critical
etcd
請參見etcd組件監控指標說明瞭解監控採集指標清單。
Prometheus採集配置
- job_name: ack-etcd
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["etcd"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replace
Prometheus警示規則
- alert: AckETCDWarning
annotations:
message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team.
expr: |
sum_over_time(etcd_server_has_leader[5m]) == 0
for: 5m
labels:
severity: critical
- alert: AckETCDWarning
annotations:
message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1
for: 5m
labels:
severity: critical
kube-scheduler
請參見kube-scheduler組件監控指標說明瞭解監控採集指標清單。
Prometheus採集配置
- job_name: ack-scheduler
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-scheduler"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replace
Prometheus警示規則
- alert: AckSchedulerWarning
annotations:
message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
kube-controller-manager
請參見kube-controller-manager組件監控指標說明瞭解監控採集指標清單。
Prometheus採集配置
- job_name: ack-kcm
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-kube-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replace
Prometheus警示規則
- alert: AckKCMWarning
annotations:
message: KCM is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-kcm",pod!=""})or(count(up{job="ack-kcm",pod!=""})<=0))>=1
for: 3m
labels:
severity: critical
cloud-controller-manager
請參見cloud-controller-manager組件監控指標說明瞭解監控採集指標清單。
Prometheus採集配置
- job_name: ack-cloud-controller-manager
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-cloud-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replace
Prometheus警示規則
- alert: AckCCMWarning
annotations:
message: CCM is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
叢集外部監控
如果您的Prometheus部署在待監控的叢集外部,請參見Configuration和Monitoring kubernetes with prometheus from outside of k8s cluster完成叢集核心組件的監控和資料擷取。主要配置如下。
- job_name: 'out-of-k8s-scrape-job'
scheme: https
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
kubernetes_sd_configs:
- api_server: 'https://<KUBERNETES URL>'
role: node
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
驗證效果
登入自建的Prometheus控制台,切換到Graph頁面。
輸入up,查看是否全部控制面組件資料顯示正常。
up
預期輸出:

up{instance="XX.XX.XX.XX:6443", job="ack-api-server"}:代理Endpoint狀態。其中,XX.XX.XX.XX是叢集default命名空間下kubernetes Service的IP,不同叢集對應的IP不同。
up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}:控制面Pod的狀態。該up指標可用於對控制面Pod進行探活檢測。
輸入以下指標,查看是否可以正常顯示。
apiserver_request_total{job="ack-api-server"}
預期輸出:

如果介面能正常顯示查詢的指標和資料,表明自建Prometheus可以正常採集核心組件指標。