Use metrics and dashboards of etcd - Container Service for Kubernetes

etcd is a persistent storage device used to store the status of and metadata information about Kubernetes clusters. As a distributed key-value store, etcd ensures strong consistency and high availability of cluster data. This topic describes the metrics of etcd. This topic also describes how to use the dashboards of etcd and provides suggestions on how to troubleshoot common metric anomalies.

Usage notes

Dashboard access

For more information, see View control plane component dashboards in ACK Pro clusters.

Metrics

Metrics can indicate the status and parameter settings of a component. The following table describes the metrics supported by etcd.

Metric	Type	Description
cpu_utilization_core	Gauge	The used CPU capacity. Unit: core.
etcd_server_has_leader	Gauge	Indicates whether the etcd server has a leader. etcd implements data consistency by using the Raft algorithm. The Raft algorithm ensures that an etcd node is elected as the leader and the other etcd nodes are followers. The leader sends heartbeats to all members on a regular basis to ensure that the cluster is stable. Valid values: 1: The etcd server has a leader. 0: The etcd server does not have a leader.
etcd_server_is_leader	Gauge	Indicates whether the etcd member is the leader. Valid values: 1: The etcd member is the leader. 0: The etcd member is not the leader.
etcd_server_leader_changes_seen_total	Counter	The number of leader changes within a specific period of time.
etcd_mvcc_db_total_size_in_bytes	Gauge	The total size of the etcd member database.
etcd_mvcc_db_total_size_in_use_in_bytes	Gauge	The usage of the etcd member database.
etcd_disk_backend_commit_duration_seconds_bucket	Histogram	The etcd backend commit delay, which is the time that etcd uses to write a data change to the storage backend and commit data. The bucket thresholds are defined as the set `{0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2.048, 4.096, 8.192}`.
etcd_debugging_mvcc_keys_total	Gauge	The total number of keys stored in etcd.
etcd_server_proposals_committed_total	Gauge	The total number of proposals committed to the Raft log. etcd implements data consistency by using the Raft algorithm. The Raft algorithm commits all actions that attempt to change the system status as proposals.
etcd_server_proposals_applied_total	Gauge	The total number of applied Raft proposals.
etcd_server_proposals_pending	Gauge	The total number of pending Raft proposals.
etcd_server_proposals_failed_total	Counter	The total number of failed Raft proposals.
memory_utilization_byte	Gauge	The memory usage. Unit: bytes.

Note

The following resource utilization metrics are deprecated. Remove any alerts and monitoring data that depend on these metrics at the earliest opportunity:

cpu_utilization_ratio: CPU utilization.
memory_utilization_ratio: Memory utilization.

Usage notes for dashboards

Dashboards are generated based on metrics and Prometheus Query Language (PromQL). The following sections describe the observability and features of the dashboards of etcd.

Observability

etcd

Features

Dashboard	PromQL	Description
etcd alive status	etcd_server_has_leader etcd_server_is_leader == 1	Specifies whether the etcd member is alive. A value of 3 indicates that the etcd member is alive. Specifies whether the etcd member is the leader. In normal cases, an etcd member must be elected as the leader.
Number of main cuts in the past day	changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])	The number of leader changes within the previous day.
Memory Usage	memory_utilization_byte{container="etcd"}	The memory usage. Unit: bytes.
CPU Usage	cpu_utilization_core{container="etcd"}*1000	The used CPU capacity. Unit: millicore.
Disk Size	etcd_mvcc_db_total_size_in_bytes	The size of the etcd backend database.
Disk Size	etcd_mvcc_db_total_size_in_use_in_bytes	The usage of the etcd backend database.
total kv	etcd_debugging_mvcc_keys_total	The total number of key-value pairs in the etcd cluster.
backend commit delay	histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))	The backend commit delay, which is the time required for the proposals to complete persistent storage in the etcd database.
Raft proposals	rate(etcd_server_proposals_failed_total{job="etcd"}[1m])	The number of failed Raft proposals per minute.
	etcd_server_proposals_pending{job="etcd"}	The total number of pending Raft proposals.
	etcd_server_proposals_committed_total{job="etcd"} - etcd_server_proposals_applied_total{job="etcd"}	The difference between the number of committed Raft proposals and that of applied Raft proposals.

Common metric anomalies

If the metrics of kube-apiserver become abnormal, check whether the metric anomalies described in the following sections exist. If metric anomalies that are not described in the following sections occur, submit a ticket.

etcd alive status

Normal case

Anomaly

Anomaly description

sum(etcd_server_has_leader)=3 and member etcd_server_is_leader == 1 are displayed when you query the health status of etcd members. This means that one of the etcd members is the leader and all three etcd members know that a leader is elected.

One etcd member is abnormal.

member etcd_server_has_leader!=1 is displayed for the etcd member. This anomaly does not affect the external services provided by the etcd cluster.

More than one etcd member is abnormal.

member etcd_server_has_leader!=1 is displayed for more than one etcd member. More than one etcd member is abnormal. In this case, the etcd cluster cannot provide external services.

Check whether etcd_server_is_leader == 1 is displayed for the etcd members. If this field is not displayed, the etcd members do not have a leader and cannot provide external services.

backend commit delay

Normal case	Anomaly	Anomaly description
The metric indicates a delay of several to tens of milliseconds.	The metric indicates a delay of hundreds of milliseconds or even several seconds for a period of time.	Disk reads and writes are abnormal.

Raft proposals

Normal case	Anomaly	Anomaly description
The number of failed Raft proposals per minute is 0.	The number of failed Raft proposals per minute is greater than 0.	Raft proposals failed to be committed. If a large number of Raft proposals failed to be committed, troubleshoot the issue.
The number of pending Raft proposals is 0.	The number of pending Raft proposals is greater than 0.	Raft proposals are pending because Raft proposals are slowly applied. Check the backend commit delay metric and troubleshoot the issue.
The difference between the number of committed Raft proposals and that of applied Raft proposals is 0.	The difference between the number of committed Raft proposals and that of applied Raft proposals is greater than 0.	etcd is overwhelmed by a large number of client requests. If the difference is greater than 5,000, etcd denies subsequent requests and returns the `too many requests` message. etcd can accept new requests only after all pending proposals are processed.

References

For more information about the metrics, usage notes for using the dashboards, and suggestions on how to troubleshoot common metric anomalies for other control plane components, see the following topics: Metrics of kube-apiserver, Metrics of kube-scheduler, kube-controller-manager, and cloud-controller-manager.