etcd is a persistent storage device used to store the status of and metadata information about Kubernetes clusters. As a distributed key-value store, etcd ensures strong consistency and high availability of cluster data. This topic describes the metrics of etcd. This topic also describes how to use the dashboards of etcd and provides suggestions on how to troubleshoot common metric anomalies.
Usage notes
Dashboard access
For more information, see View control plane component dashboards in ACK Pro clusters.
Metrics
Metrics can indicate the status and parameter settings of a component. The following table describes the metrics supported by etcd.
Metric | Type | Description |
cpu_utilization_core | Gauge | The used CPU capacity. Unit: core. |
etcd_server_has_leader | Gauge | Indicates whether the etcd server has a leader. etcd implements data consistency by using the Raft algorithm. The Raft algorithm ensures that an etcd node is elected as the leader and the other etcd nodes are followers. The leader sends heartbeats to all members on a regular basis to ensure that the cluster is stable. Valid values:
|
etcd_server_is_leader | Gauge | Indicates whether the etcd member is the leader. Valid values:
|
etcd_server_leader_changes_seen_total | Counter | The number of leader changes within a specific period of time. |
etcd_mvcc_db_total_size_in_bytes | Gauge | The total size of the etcd member database. |
etcd_mvcc_db_total_size_in_use_in_bytes | Gauge | The usage of the etcd member database. |
etcd_disk_backend_commit_duration_seconds_bucket | Histogram | The etcd backend commit delay, which is the time that etcd uses to write a data change to the storage backend and commit data. The bucket thresholds are defined as the set |
etcd_debugging_mvcc_keys_total | Gauge | The total number of keys stored in etcd. |
etcd_server_proposals_committed_total | Gauge | The total number of proposals committed to the Raft log. etcd implements data consistency by using the Raft algorithm. The Raft algorithm commits all actions that attempt to change the system status as proposals.
|
etcd_server_proposals_applied_total | Gauge | The total number of applied Raft proposals. |
etcd_server_proposals_pending | Gauge | The total number of pending Raft proposals. |
etcd_server_proposals_failed_total | Counter | The total number of failed Raft proposals. |
memory_utilization_byte | Gauge | The memory usage. Unit: bytes. |
The following resource utilization metrics are deprecated. Remove any alerts and monitoring data that depend on these metrics at the earliest opportunity:
cpu_utilization_ratio: CPU utilization.
memory_utilization_ratio: Memory utilization.
Usage notes for dashboards
Dashboards are generated based on metrics and Prometheus Query Language (PromQL). The following sections describe the observability and features of the dashboards of etcd.
Observability
Features
Dashboard | PromQL | Description |
etcd alive status |
|
|
Number of main cuts in the past day | changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d]) | The number of leader changes within the previous day. |
Memory Usage | memory_utilization_byte{container="etcd"} | The memory usage. Unit: bytes. |
CPU Usage | cpu_utilization_core{container="etcd"}*1000 | The used CPU capacity. Unit: millicore. |
Disk Size | etcd_mvcc_db_total_size_in_bytes | The size of the etcd backend database. |
etcd_mvcc_db_total_size_in_use_in_bytes | The usage of the etcd backend database. | |
total kv | etcd_debugging_mvcc_keys_total | The total number of key-value pairs in the etcd cluster. |
backend commit delay | histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le)) | The backend commit delay, which is the time required for the proposals to complete persistent storage in the etcd database. |
Raft proposals | rate(etcd_server_proposals_failed_total{job="etcd"}[1m]) | The number of failed Raft proposals per minute. |
etcd_server_proposals_pending{job="etcd"} | The total number of pending Raft proposals. | |
etcd_server_proposals_committed_total{job="etcd"} - etcd_server_proposals_applied_total{job="etcd"} | The difference between the number of committed Raft proposals and that of applied Raft proposals. |
Common metric anomalies
If the metrics of kube-apiserver become abnormal, check whether the metric anomalies described in the following sections exist. If metric anomalies that are not described in the following sections occur, submit a ticket.
etcd alive status
Normal case | Anomaly | Anomaly description |
| One etcd member is abnormal. |
|
More than one etcd member is abnormal. |
Check whether |
backend commit delay
Normal case | Anomaly | Anomaly description |
The metric indicates a delay of several to tens of milliseconds. | The metric indicates a delay of hundreds of milliseconds or even several seconds for a period of time. | Disk reads and writes are abnormal. |
Raft proposals
Normal case | Anomaly | Anomaly description |
The number of failed Raft proposals per minute is 0. | The number of failed Raft proposals per minute is greater than 0. | Raft proposals failed to be committed. If a large number of Raft proposals failed to be committed, troubleshoot the issue. |
The number of pending Raft proposals is 0. | The number of pending Raft proposals is greater than 0. | Raft proposals are pending because Raft proposals are slowly applied. Check the backend commit delay metric and troubleshoot the issue. |
The difference between the number of committed Raft proposals and that of applied Raft proposals is 0. | The difference between the number of committed Raft proposals and that of applied Raft proposals is greater than 0. | etcd is overwhelmed by a large number of client requests. If the difference is greater than 5,000, etcd denies subsequent requests and returns the |
References
For more information about the metrics, usage notes for using the dashboards, and suggestions on how to troubleshoot common metric anomalies for other control plane components, see the following topics: Metrics of kube-apiserver, Metrics of kube-scheduler, kube-controller-manager, and cloud-controller-manager.