This topic describes how to collect the metrics from kube-apiserver, etcd, kube-scheduler, cloud-controller-manager, and kube-controller-manager in a Container Service for Kubernetes (ACK) Pro cluster to a self-managed Prometheus system. This topic also provides recommended alerting configurations.
Prerequisites
The self-managed Prometheus system can access the API server of your ACK Pro cluster and has permissions to read /metrics
.
The self-managed Prometheus system can be deployed inside or outside the ACK Pro cluster.
Background information
ACK Pro allows you to sink the monitoring data of key control plane components to external systems and provides built-in dashboards for these components in Application Real-Time Monitoring Service (ARMS). These control plane components include kube-apiserver, etcd, kube-scheduler, cloud-controller-manager, and kube-controller-manager. If you choose to use ARMS to monitor the components, the ARMS agent will automatically collect monitoring data and display the data on the built-in dashboards in real time. If you want to use a self-managed Prometheus system to collect the metrics of the key control plane components in an ACK Pro cluster and generate alerts, use the configurations provided in this topic.
Configure metric collection for the Prometheus system
To use a self-managed Prometheus system to collect the metrics of the key control plane components in an ACK Pro cluster, add metric collection Jobs to the configuration file prometheus.yaml of the Prometheus system. The following code block shows an example of the Job:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: ack-api-server
......
- job_name: ack-etcd
......
- job_name: ack-scheduler
......
Each Job corresponds to a component. For more information, refer to the metrics supported by each component. For more information about how to configure Prometheus.yaml for open source Prometheus, see Configuration.
For more information about open source Prometheus Operator and the ack-prometheus-operator component provided by the marketplace of ACK, see Open source Prometheus monitoring. For more information about how to configure custom metric collection configurations, see Prometheus Operator.
Configure alert rules for the Prometheus system
For more information about how to configure alert rules for open source Prometheus, see Alerting_rules.
Use an internal Prometheus system to monitor an ACK Pro cluster
You can deploy a Prometheus system inside an ACK Pro cluster and then use the Prometheus system to monitor the cluster.
kube-apiserver
For more information about kube-apiserver, see Monitor kube-apiserver.
Metric collection configurations
- job_name: ack-api-server
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["apiserver"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Alert rules
- alert: AckApiServerWarning
annotations:
message: APIServer is not available in last 5 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1
for: 5m
labels:
severity: critical
etcd
For more information about the etcd, see Monitor the etcd.
Metric collection configurations
- job_name: ack-etcd
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["etcd"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Alert rules
- alert: AckETCDWarning
annotations:
message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team.
expr: |
sum_over_time(etcd_server_has_leader[5m]) == 0
for: 5m
labels:
severity: critical
- alert: AckETCDWarning
annotations:
message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1
for: 5m
labels:
severity: critical
kube-scheduler
For more information about kube-scheduler, see Monitor kube-scheduler.
Metric collection configurations
- job_name: ack-scheduler
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-scheduler"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Alert rules
- alert: AckSchedulerWarning
annotations:
message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
kube-controller-manager
For more information about kube-controller-manager, see Monitor kube-controller-manager.
Metric collection configurations
- job_name: ack-kcm
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-kube-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Alert rules
- alert: AckKCMWarning
annotations:
message: KCM is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-kcm",pod!=""})or(count(up{job="ack-kcm",pod!=""})<=0))>=1
for: 3m
labels:
severity: critical
cloud-controller-manager
For more information about cloud-controller-manager, see Monitor cloud-controller-mananger.
Metric collection configurations
- job_name: ack-cloud-controller-manager
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
# scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["ack-cloud-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes,
insecure_skip_verify: false}
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
Alert rules
- alert: AckCCMWarning
annotations:
message: CCM is not available in last 3 minutes. Please check the prometheus job and target status.
expr: |
(absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
Use an external Prometheus system to monitor an ACK Pro cluster
If you want to use an external Prometheus system to monitor an ACK Pro cluster, refer to Configuration and Monitoring kubernetes with prometheus from outside of k8s cluster. The following code block shows the configurations:
- job_name: 'out-of-k8s-scrape-job'
scheme: https
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
kubernetes_sd_configs:
- api_server: 'https://<KUBERNETES URL>'
role: node
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
Verify the monitoring and alerting features
Log on to the console of the self-managed Prometheus system and go to the Graph page.
Enter up to check whether all control plane components are displayed.
up
Expected output
Important up{instance="x.x.x.x:6443", job="ack-api-server"}
indicates the status of the agent endpoint. x.x.x.x
indicates the IP address of the Kubernetes Service in the default namespace of the ACK Pro cluster. The IP address varies by cluster.
up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}
indicates the status of the control plane pods. You can probe the liveness of the control plane pods by using the up
metric.
Enter the following metric and check whether the metric value is displayed as normal:
apiserver_request_total{job="ack-api-server"}
Expected output
If the metric and value are displayed as normal, the self-managed Prometheus system can collect the metrics of the control plane components as expected.