本文介绍如何基于用户自建的Prometheus,采集ACK Pro集群的控制平面组件监控APIServer、etcd、Scheduler、KCM、CCM指标配置,并介绍推荐的报警配置。
背景信息
ACK Pro提供控制平面核心组件监控对外透出的功能,并基于ARMS预置了相关的组件监控大盘,具体包括APIServer、Cloud Controller Manager、etcd、Kube Controller Manager和Scheduler,如果您选用了ARMS监控能力,监控数据会被ARMS代理自动采集并在监控大盘上实时展示。如果您希望通过自建Prometheus采集ACK Pro集群的控制平面核心组件指标并配置相应告警,实现与自建监控系统的集成,可以基于本文进行配置。
Prometheus采集配置
使用自建的Prometheus采集ACK Pro集群控制平面核心组件指标,首先需要在Prometheus的配置文件prometheus.yaml中配置指标采集Job,配置文件格式如下:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: ack-api-server
......
- job_name: ack-etcd
......
- job_name: ack-scheduler
......
其中,每个核心组件对应一个Job配置,具体配置可参见对应核心组件的指标清单。社区Prometheus配置Prometheus.yaml方法,请参见Configuration。
社区Prometheus Operator方案以及ACK应用市场ack-prometheus-operator组件的相关信息,请参见开源Prometheus监控。关于自定义采集配置,请参见Prometheus Operator社区官方文档Prometheus Operator进行数据采集配置。
ACK Pro集群内部监控
内部监控是将Prometheus部署在待监控的ACK Pro集群内的监控形式。
kube-apiserver
关于kube-apiserver组件的更多信息,请参见kube-apiserver组件监控。
Prometheus采集配置
- job_name: ack-api-server
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["apiserver"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Prometheus告警规则
- alert: AckApiServerWarning
annotations:
message: APIServer is not available in last 5 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1
for: 5m
labels:
severity: critical
etcd
关于etcd组件的更多信息,请参见etcd组件监控。
Prometheus采集配置
- job_name: ack-etcd
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["etcd"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Prometheus告警规则
- alert: AckETCDWarning
annotations:
message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team.
expr: |
sum_over_time(etcd_server_has_leader[5m]) == 0
for: 5m
labels:
severity: critical
- alert: AckETCDWarning
annotations:
message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1
for: 5m
labels:
severity: critical
kube-scheduler
关于kube-scheduler组件的更多信息,请参见kube-scheduler组件监控。
Prometheus采集配置
- job_name: ack-scheduler
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-scheduler"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Prometheus告警规则
- alert: AckSchedulerWarning
annotations:
message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
kube-controller-manager
关于kube-controller-manager组件的更多信息,请参见kube-controller-manager组件监控。
Prometheus采集配置
- job_name: ack-kcm
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-kube-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Prometheus告警规则
- alert: AckKCMWarning
annotations:
message: KCM is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-kcm",pod!=""})or(count(up{job="ack-kcm",pod!=""})<=0))>=1
for: 3m
labels:
severity: critical
cloud-controller-manager
关于cloud-controller-manager组件的更多信息,请参见cloud-controller-mananger组件监控。
Prometheus采集配置
- job_name: ack-cloud-controller-manager
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-cloud-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Prometheus告警规则
- alert: AckCCMWarning
annotations:
message: CCM is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
ACK Pro集群外部监控
如果需要使用ACK Pro集群外的Prometheus来监控Kubernetes集群,具体操作,请参见Configuration和Monitoring kubernetes with prometheus from outside of k8s cluster。主要配置如下:
- job_name: 'out-of-k8s-scrape-job'
scheme: https
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
kubernetes_sd_configs:
- api_server: 'https://<KUBERNETES URL>'
role: node
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
验证效果
登录自建的Prometheus控制台,切换到Graph页面。
输入up,查看是否全部控制平面组件都可以显示。
up
预期输出:
重要 up{instance="x.x.x.x:6443", job="ack-api-server"}
是作为代理的Endpoint状态。其中,x.x.x.x
是K8s集群default命名空间下Kubernetes Service的IP,不同用户集群该IP不同。
up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}
是具体控制面Pod的状态。可以使用该up
指标为控制面Pod做探活检测。
输入以下指标,查看是否可以正常显示。
apiserver_request_total{job="ack-api-server"}
预期输出:
如果界面能正常显示查询的指标和数据,说明自建Prometheus可以正常采集控制平面核心组件指标。