This topic describes the current situation of Kubernetes cluster observability, the challenges for the observability of multi-cloud Kubernetes clusters, and the solutions to these challenges. This topic also provides an example to describe how to use Alibaba Cloud Managed Service for Prometheus and registered Kubernetes clusters to manage and monitor multi-cloud Kubernetes clusters.
Current situation of Kubernetes cluster observability
As a container management and orchestration tool, Kubernetes has become a common technical base in the cloud computing industry. At the same time, Prometheus has been serving as a standard solution for Kubernetes cluster monitoring after it is proven in the iteration of various solutions.
Prometheus collects and stores metrics from the monitoring system layer, application layer, and business layer. In addition, Prometheus uses Grafana to display metrics and deliver alert events. The combination of Prometheus and Grafana allows you to collect, store, display, and configure alerting for the monitoring metrics of Kubernetes clusters, helps you identify issues and analyze the causes of issues, and protects cloud-native applications. It has become a standard solution for Kubernetes cluster monitoring in the cloud computing industry.
You can use one of the following solutions to manage Kubernetes clusters:
Solution 1: Build a monitoring system
You can use Prometheus and Grafana to build a monitoring system for your production environment. In the early stage, you need to invest a large amount of labor cost, and focus on the collaboration among the parts of the monitoring system. For example, you need to monitor the metric collection, metric storage, metric display, dashboards, and alerts, including alert deduplication. In the later stage, huge O&M cost is incurred.
Solution 2: Use monitoring services provided by cloud service vendors
You can use a monitoring service provided by a cloud service vendor, such as Alibaba Cloud Managed Service for Prometheus. Managed Service for Prometheus supports two billing methods: subscription and pay-as-you-go. This reduces the upfront cost of monitoring system setup and provides technical O&M support to reduce O&M cost.
Challenges for the observability of multi-cloud Kubernetes clusters
Enterprises are deploying more and more diversified and complex services on the cloud. In some scenarios, Kubernetes clusters may be used across cloud services or regions, and you must address the O&M challenges for multi-cloud Kubernetes clusters.
You can use one of the following solutions to monitor multi-cloud Kubernetes clusters:
Solution 1: Build a monitoring system based on a self-built Managed Service for Prometheus and Grafana
- To build a self-built monitoring system, you must integrate functional modules such as collection, storage, display, and alerting in the early stage. In the later stage, you must assign more O&M personnel, which causes increased O&M cost.
- The time series databases (TSDB) of the open source Managed Service for Prometheus uses SSD storage. Data is separately stored in single sites, which may result in data loss.
- Bottlenecks exist in the collection capabilities of the open source Managed Service for Prometheus. Due to single-part operations, the open source Managed Service for Prometheus does not support auto scaling. During peak hours, performance bottlenecks may arise from monitoring metric collection.
Solution 2: Use Managed Service for Prometheus provided by cloud service vendors
- Multiple cloud service vendors: Different cloud service vendors provide different monitoring capabilities and access methods. This increases your learning cost.
- Decentralized management: Different Managed Service for Prometheus cannot be managed in a unified manner. This may lead to inefficient and chaotic management and duplicate O&M workloads. You may also be unable to identify business issues at the earliest opportunity.
In the preceding solutions, you cannot query or analyze scattered metrics or configure alerting for the metrics in a unified manner.
Benefits of Alibaba Cloud Managed Service for Prometheus
To address the preceding challenges, registered Kubernetes clusters provide unified management capabilities for the Kubernetes clusters of third-party cloud service vendors. This helps you manage multi-cloud Kubernetes clusters in a unified manner. Alibaba Cloud Managed Service for Prometheus provides a complete Kubernetes cluster monitoring system with metric collection, Grafana display, and alerting capabilities. Managed Service for Prometheus supports the pay-as-you-go and subscription billing methods to improve the monitoring efficiency of Kubernetes clusters and reduce the O&M cost of user-created monitoring services.
- Powerful capabilities: The combination of Alibaba Cloud Managed Service for Prometheus and registered Kubernetes clusters can resolve the issues that exist in multi-cloud Kubernetes cluster monitoring, such as scattered management, difficulty in monitoring system construction, low O&M efficiency, inability to jointly query metrics, and scattered alerting. You can implement unified management, configuration, query, and alerting for multi-cloud Kubernetes cluster monitoring with improved efficiency at low O&M cost. This way, O&M teams can focus on business without doing repetitive work.
- Lower cost: Alibaba Cloud Managed Service for Prometheus provides basic-metric collection for free to meet the basic monitoring requirements on Kubernetes clusters. For small-scale Kubernetes clusters, you can use the pay-as-you-go billing method to monitor your business at minimum cost. For information about the pay-as-you-go billing method of Alibaba Cloud Managed Service for Prometheus, see Pay-as-you-go. For large-scale Kubernetes clusters, you can use the subscription billing method. Compared with the pay-as-you-go billing method, the subscription billing method can effectively reduce the monitoring cost of large-scale clusters by about 67%.
- Less resource usage: If you use Alibaba Cloud Managed Service for Prometheus, you only need to deploy a lightweight agent in your Kubernetes cluster. The agent provides the auto scaling capabilities. If your Kubernetes cluster has 2 cores and 4 GB of memory, you can collect 6 million metrics. To reduce the pressure that the service discovery module of the open source Managed Service for Prometheus causes to the API Server of Kubernetes clusters, Alibaba Cloud Managed Service for Prometheus has optimized the service discovery module. This minimizes resource usage, maximizes the collection of monitoring metrics from Kubernetes clusters, and protects your business.
Advantage 1: improved performance
Item | Alibaba Cloud Managed Service for Prometheus | Self-built Prometheus service |
---|---|---|
High availability | Alibaba Cloud Managed Service for Prometheus provides high availability and supports horizontal scaling. You can deploy multiple replicas for the collection and storage components. | Self-built Prometheus services provide low availability and does not support horizontal scaling. You can run only one process at a time. |
Data storage | Cloud-based storage has unlimited storage capacity. | The storage capacity is limited. |
Data visualization | Grafana is built into the ARMS console and common monitoring templates are available out of the box. | You must deploy Grafana and configure dashboards on your own. |
Alert management | The alert center of ARMS is integrated with Managed Service for Prometheus to improve alert efficiency and accuracy. | You must install the AlertManager plug-in on your own. |
Collection performance of a single replica (2-core CPU, 4 GB of memory) | 6 million data points | 1 million data points |
Data query performance (600 million time points) | 8 to 10 seconds | 180 seconds |
Other capabilities | Managed Service for Prometheus provides pre-aggregation, downsampling, and GlobalView capabilities. | Not supported |
Advantage 2: aggregated Prometheus multi-cluster query
ARMS provides a virtual aggregation instance for multiple Alibaba Cloud Prometheus instances or self-managed Prometheus clusters. The virtual aggregation instance can be used to query Prometheus metrics, manage Grafana data sources, and manage alerts in a unified manner.
- To manage the scattered data of the open source Prometheus service, Alibaba Cloud Managed Service for Prometheus allows you to configure multiple data source addresses in Grafana. Otherwise, the running status of applications in different regions around the world is difficult to be analyzed from an overall perspective due to the isolation of data sources.
- You do not need to deploy Prometheus Server in each region or deploy a large number of Thanos components. You only need to use Remote Write to report data to Alibaba Cloud Managed Service for Prometheus.
- Alibaba Cloud Managed Service for Prometheus provides global, distributed, stable, and high-performance query capabilities. Horizontal and vertical scaling can be implemented at any time for a large number of queries.
- Aggregated Prometheus multi-cluster query can be implemented out of the box. You do not need to deploy any components in addition to Alibaba Cloud Managed Service for Prometheus. This helps you reduce O&M cost.
Advantage 3: lightweight installation
Compared with the open-source Prometheus service, Alibaba Cloud Managed Service for Prometheus is easy to be deployed. You only need to install a lightweight agent in your Kubernetes cluster. Backend storage can be hosted to save the cluster resource usage of business.
Advantage 4: integration of the Grafana service
Alibaba Cloud Grafana Service is a cloud-native O&M data visualization platform that provides maintenance-free and quick startup capabilities. Alibaba Cloud Grafana Service provides the following benefits:
- By default, the data sources of various Alibaba Cloud services, such as Managed Service for Prometheus and Log Service are integrated. Third-party data sources or user-created data sources are supported. This allows you to quickly build integrated O&M dashboards.
- Alibaba Cloud Grafana Service provides exclusive instances, service-level agreement (SLA) assurance, and reliable O&M. Grafana Service also ensures high availability and elasticity of the monitoring system at lower maintenance cost.
- Alibaba Cloud Grafana Service supports Alibaba Cloud Single Sign-On (SSO) and self-managed account systems to implement fine-grained management of data sources and dashboards without compromising data security.
- Alibaba Cloud Grafana Service can resolve the following issues:
- Difficulty in data aggregation: The monitoring data of various cloud services is difficult to be aggregated and unified, which increases the difficulty of O&M.
- Difficulty in O&M: The core metrics in the monitoring charts of various cloud services must be repeatedly configured.
- Difficulty in alert management: The alert rules of various cloud services are scattered and difficult to be managed.
- Alibaba Cloud Grafana Service can provide the following capabilities:
- Default integrations: Alibaba Cloud Grafana Service is integrated with key Alibaba Cloud services, such as elastic computing services and database services by default.
- Unified dashboards: A unified dashboard system is established across data sources to optimize visualized O&M.
- Unified alerting: You can easily build an integrated alerting system to improve the efficiency of alert management.
Advantage 5: integration of the Alibaba Cloud alerting system
By default, Alibaba Cloud Managed Service for Prometheus is integrated with the Alert Management sub-service of ARMS. The Alert Management sub-service has the following features:
- Globalization
- You can globalize alert rule templates to configure alerting for global events.
- You can globalize contacts and notification policies by configuring simple settings.
- Event collection with higher management efficiency
- You can integrate Alert Management with common monitoring services of Alibaba Cloud. You can also integrate Alert Management with third-party monitoring services for centralized management.
- Alert Management provides stable alert event handling capabilities. You can handle alert events 24/7.
- Alert Management ensures low latency for handling a large number of alert events.
- Timely and accurate alert notifications
- You can configure notification policies and compress alert events. This reduces the O&M workloads.
- You can select one or more notification methods based on the urgency of an alert. For example, you can send alert notifications to contacts by email, SMS, phone call, or DingTalk to remind the contacts to handle the alert.
- You can configure an escalation policy to send notifications to contacts multiple times if an alert remains unhandled for a long period of time.
- Efficient alert management
- Contacts can use DingTalk to handle alerts anytime.
- Alerts use a common format, which allows contacts to better analyze alerts.
- Multiple contacts can work together through DingTalk to handle alerts.
- Alert event reprocessing
- You can use event processing flows to orchestrate simple procedures and process alert events that are reported by an alert source. This meets your specific requirements on event handling in various scenarios.
- You can deduplicate, compress, denoise, and silence alerts that are reported by an alert source. This converges alerts and reduces alert storms.
- Alert configuration management
- Alert Management provides monitoring templates that contain common core metrics of Kubernetes clusters. Alert Management also provides the alert template feature to automatically generate and send alert templates. This way, you can configure multiple alerts at a time.
- Alert Management provides a visualized alert configuration wizard and preview. You can view and precisely configure alert conditions and events in real time.
- You can view alerting statistics, analyze alert handling results in real time, improve alert handling efficiency, and monitor business status.
Example: Monitor a multi-cloud Kubernetes cluster in Alibaba Cloud Managed Service for Prometheus
Prerequisites
Resource Access Management (RAM) is activated in the RAM console. Auto Scaling is activated in the Auto Scaling console.
- The Kubernetes cluster is connected to Alibaba Cloud over the Internet or an internal network. For more information, see What are the requirements for connecting an external cluster to the cluster registration proxy?.
Step 1: Create a registered Kubernetes cluster
- Log on to the Container Service for Kubernetes (ACK) console .
- In the left-side navigation pane, click Clusters.
- In the upper-right corner of the Clusters page, click Create Kubernetes Cluster.
- On the Register Cluster tab, set the parameters. For more information, see Register an external Kubernetes cluster.
- On the right side of the page, click Create. You can view the registered cluster on the Clusters page.
Step 2: Manage a multi-cloud Kubernetes cluster in the registered Kubernetes cluster
In this example, Tencent Kubernetes Engine (TKE) is used to describe how to manage a TKE cluster in a registered Kubernetes cluster, and capture and display metrics in Alibaba Cloud Managed Service for Prometheus.
- On the Clusters page, find the registered cluster that you created in Step 1: Create a registered Kubernetes cluster and click Details in the Actions column.
- Click the Connection Information tab. On the Public Access tab, view the cluster credentials for connecting to the cluster over the Internet and click Copy.
- Log on to the Tencent Cloud TKE console. On the Clusters page, click the name of the TKE cluster. In the upper-right corner of the page, click Create Resource in YAML. In the dialog box that appears, paste the cluster credentials that you copied in the previous step to the editor, and click OK. Then, check the running status of Deployment and ack-cluster-agent on the Clusters page. If Deployment and ack-cluster-agent are running as expected, the installation is successful.
- Log on to the Container Service for Kubernetes (ACK) console. On the Clusters page, check the status of the registered Kubernetes cluster that you created in Step 1: Create a registered Kubernetes cluster. If the registered Kubernetes cluster is in the Running state, the TKE cluster is managed.
Step 3: Install Prometheus components
- Log on to the Container Service for Kubernetes (ACK) console.
- In the left-side navigation pane, click Clusters. On the Clusters page, click the registered Kubernetes cluster.
- In the left-side navigation pane, choose .
- In the Logs and Monitoring section of the Add-ons page, click Install in the ack-arms-prometheus component card to install the ack-arms-prometheus component.
- Log on to the Tencent Cloud TKE console. On the Clusters page, click the name of the TKE cluster. In the left-side navigation pane, choose . Then, select the arms-prom namespace to view the status of the arms-prometheus-ack-arms-prometheus component. If the component is running as expected, the installation is successful.
- Log on to the ARMS console.
- In the left-side navigation pane, choose .
- Click the Prometheus instance that monitors the registered Kubernetes cluster created in Step 1: Create a registered Kubernetes cluster.
- In the left-side navigation pane, click Service Discovery. On the Targets tab, view the status of the Target configured by default. If the Target is in the Collecting state, Managed Service for Prometheus is collecting metric data. Click the name of the Target to view the specific source data.
Step 4: View monitoring data
By default, Managed Service for Prometheus is integrated with Grafana dashboards to allow you to view monitoring data, such as the Deployment dashboard and DaemonSet dashboard. You can perform the following steps to view monitoring data on dashboards:
- Log on to the ARMS console.
- In the left-side navigation pane, choose .
- Click the Prometheus instance that monitors the registered Kubernetes cluster created in Step 1: Create a registered Kubernetes cluster.
- In the left-side navigation pane, click Dashboards. On the Dashboards page, you can click the name of a dashboard to view detailed metrics.
Step 5: View Alibaba Cloud Managed Service for Prometheus alerts
By default, Managed Service for Prometheus enables the monitoring of core metrics for Kubernetes clusters. This prevents errors that may occur if you manually enable Managed Service for Prometheus. In addition, Managed Service for Prometheus is integrated with a variety of alert templates with core metrics. You can use these alert templates based on your business requirements without the need to write PromQL code. To view Managed Service for Prometheus alerts, perform the following steps:
- Log on to the ARMS console.
- In the left-side navigation pane, choose .
- Click the Prometheus instance that monitors the registered Kubernetes cluster created in Step 1: Create a registered Kubernetes cluster.
- In the left-side navigation pane, click Alert Rules. On the Prometheus Alert Rules page, view the alerts.
Activation
- Registered Kubernetes clusters: For information about how to activate a registered Kubernetes cluster, see Register an external Kubernetes cluster.
- Managed Service for Prometheus: Managed Service for Prometheus provides the subscription billing method. Compared with the pay-as-you-go billing method, the subscription billing method saves at least 67% of your cost.