Pull or Push: How to Select Monitoring Systems?

This article introduces the Pull or Push selection in the monitoring system, comparing the two on the basis of various aspects encountered during actual customer scenarios.

1. Types of Monitoring Systems

Monitoring has always been the core component of IT systems. It is responsible for discovering and assisting in positioning problems. Whether it is traditional O&M, SRE, or DevOps, developers need to pay attention to the monitoring system and participate in its construction and optimization. From the mainframe operating system and Linux basic indicators initially, the monitoring system began to appear and gradually evolved. Now, there are over hundreds of monitoring systems that can be divided into different categories from different points of view, such as:

Monitoring Object: General-purpose (general monitoring method, suitable for most monitoring objects); dedicated (customized for a specific function, such as Java JMX system, CPU high-temperature protection, hard disk power failure protection, UPS switching system, switch monitoring system, private line monitoring).
Data Acquisition Method: Push (CollectD, Zabbix, and InfluxDB); Pull (Prometheus, SNMP, and JMX).
Deployment Mode: Coupled (deployed with monitoring system); standalone (standalone, single-instance deployment); distributed (can be scaled out horizontally); SaaS (no need for deployment, provided by many commercial companies).
Data Acquisition Method: Interface type (data can be obtained only through some APIs); DSL (involve some calculations, such as PromQL and GraphQL); SQL (standard SQL, SQL-like).
Commercial Attributes: Open-source and free of charge (such as Prometheus and InfluxDB standalone edition); open-source commercial type (such as InfluxDB cluster edition, Elastic Search X-Pack); closed-source commercial type (such as DataDog, Splunk, and AWS Cloud Watch).

2. Pull or Push?

There are many options for the construction of a monitoring system platform for internal use, whether it is self-built with open-source solutions or commercial SaaS products. However, irrespective of whether it is an open-source solution or a commercial SaaS product, in the actual implementation, we need to consider how to pass the data to the monitoring platform or how the monitoring platform can obtain the data. This involves the selection of data acquisition methods: Pull or Push.

The Pull-based monitoring system, as the name implies, is a monitoring system that actively obtains indicators, and the objects that require monitoring need to have the capability to be accessed remotely. Push-based monitoring systems do not actively obtain data, but the monitored objects actively push indicators. There are differences between the two methods in many aspects. For the construction and selection of monitoring systems, we must understand the advantages and disadvantages of these two methods in advance and choose the appropriate scheme for implementation. Otherwise, the subsequent maintenance cost of ensuring the stability of the monitoring system and the cost of deployment and O&M will be huge.

3. Pull and Push Monitoring Overview

The following table summarizes the different aspects of Pull and Push monitoring systems for subsequent discussion. The details will be elaborated on later.

4. Principle and Architecture Comparison

Pull Model Architecture

As shown in the figure above, the core of the Pull model's data acquisition is the Pull module, which is generally deployed together with the monitoring backend, such as Prometheus. The core components include:

Service discovery system, including host service discovery (generally depends on the company's CMDB system), application service discovery (such as Consul), and PaaS service discovery (such as Kubernetes). The pull module needs to have the capability to connect to these service discovery systems.
In addition to the service discovery part, the Pull core module generally uses the common protocol to pull data from the remote end. It usually supports configuration pull interval, timeout interval, metric filtering, rename, and simple process capabilities.
The application-end SDK supports listening to a fixed port to provide the capability to be pulled.
Since various middleware and other systems are incompatible with the Pull protocol, it is necessary to develop an Agent corresponding to Exporter to support pulling the metrics of these systems and provide standard Pull interfaces.

Push Model Architecture

The Push model is simple, and can be described as follows.

Push Agent supports pulling the metric data of various monitored objects and pushing them to the server. It can be deployed in a manner coupled with the monitoring system or deployed separately.
ConfigCenter (optional) can provide centralized dynamic configuration capabilities, such as monitoring targets, collection intervals, metric filtering, metric processing, and remote targets.
The application-end SDK supports sending data to the monitoring backend or the local agent (usually, the local agent also implements a set of backend interfaces).

Summary

The Pull model's deployment method is too complicated for monitoring middleware and other systems, and the maintenance cost is also high. While Push mode is relatively convenient, the application's cost to provide a Metrics port or proactively push for deployment is almost the same.

5. Pull's Distributed Solution

In terms of scalability, Push data acquisition is natively distributed. When the monitoring backend capability supports, it is scalable horizontally without limit. In contrast, the Pull scalability method is more troublesome and requires:

The Pull module is decoupled from the monitoring backend, and Pull is deployed separately as an agent.
Pull Agent needs to realize distributed collaboration, and the simplest way to do it is sharding. For example, you can obtain a list of monitored machines from the service discovery system, perform the Hash modulo operation on these machines, and determine which agent is responsible for Pull through sharding.
Add a configuration center (optional) to manage each Pull Agent.

As you might have seen, there are still some problems with this distributed method:

Single-point bottleneck still exists, and all agents need to request the service discovery module.
After the agent is scaled out, the monitoring target will change, which may cause data duplication or missing data.

6. Monitoring Capabilities Comparison

6.1 Monitoring Target Survivability

Survivability is the first and most basic work required for monitoring. In the Pull mode, monitoring target survivability is relatively simple to detect. At the center of Pull, we can ascertain whether we can request the target indicators. If a failure occurs, we will get notifications for simple errors such as network timeout and peer rejection of connection.

The Push mode is more troublesome. If the application does not report, it may be an application breakdown, a network problem, or the application may be migrated to other nodes. Since the Pull module can perform filter interaction with service discovery in real-time and the Push mode cannot do this, we can only know the specific cause of failure after the server interacts with service discovery.

6.2 Data Completeness Calculation

The concept of data completeness is rather crucial in the large-scale monitoring system. For example, when monitoring the QPS of a transaction application with 1,000 copies, we need to combine this indicator with 1,000 pieces of data. Suppose, in the absence of data completeness; the QPS is configured to trigger an alarm with a 2% reduction. In such a scenario, when the data reported by more than 20 copies is delayed by several seconds due to network fluctuations, a false alarm will be triggered. Therefore, it is necessary to consider the data completeness while configuring the alarm.

The calculation of data completeness also depends on the service discovery module. The Pull method is to pull data one round after another, so the data is complete after one round of pull. Even if some pull fails, the percentage of incomplete data is known.

While in the Push mode, each Agent and application proactively performs Push. The Push interval and network latency of each client are different. According to the historical situation, the server needs to calculate the data completeness, bringing huge costs.

6.3 Short Lifecycle/Serverless Application Monitoring

In actual scenarios, there are many short-lifecycle/serverless applications. In particular, in cost-friendly cases, we will use a large number of jobs, elastic instances, and serverless applications. For example, an elastic computing instance is started after a rendering task arrives. On completion, it will be immediately destroyed and released. The situation is the same in machine learning training jobs, event-driven serverless workflows, and jobs that are executed regularly (such as resource cleanup, capacity checks, and security scans). These applications usually have a short life cycle (maybe within seconds or milliseconds). Pull's regular model is difficult to monitor. Generally, we need to use Push to make applications proactively push monitoring data.

For managing such short-life cycle applications, pure Pull systems will provide an intermediate layer (such as Prometheus Push Gateway) to accept proactive Push from applications and then provide Pull ports to the monitoring system. But this leads to the management and O&M costs of an additional intermediate layer. Since this mode is achieved by Pull simulating Push, the reporting latency will increase, and these metrics that disappear immediately need to be cleaned up in time.

6.4 Flexibility and Coupling

In terms of flexibility, the Pull mode has the advantage. We can configure the indicators we want in the Pull module and do some simple calculations and secondary processing on the indicators. However, this advantage is not overwhelming. Push SDK/Agent can also configure these parameters. With the help of the configuration center, configuration management is simple.

The coupling degree between the Pull model and the backend is low. It only needs to provide an interface that the backend understands. It is unnecessary to know which backend is connected and which indicators the backend needs. The division of labor is clear. Application developers only need to expose indicators of applications, and Site Reliability Engineer (SRE) can obtain these indicators. The Push model has a higher coupling degree. The application needs to configure the backend address and authentication information. However, by relying on the local Push Agent, the application only needs to push the local address, which will not bring a huge cost.

7. Resource and O&M Cost

7.1 Resource Cost

In terms of overall cost, there is a slight difference between the two approaches, but considering the cost distribution:

The consumption of the Pull mode is mainly concentrated in the monitoring system end, and its consumption is low in the application-end.
The core consumption of the Push mode is concentrated in the Push Agent end, and its consumption in the monitoring system end is much lower than that of Pull.

7.2 O&M Cost

In terms of O&M, the cost of the Pull mode is higher. Pull mode's components responsible for O&M include various exporters, service discovery, Pull Agent, and monitoring backend. While the Push mode only needs to perform O&M for Push Agent, monitoring backend, and configuration center (optional, deployment is generally completed together with monitoring backend).

One thing to note here is that in the Pull mode, since the server proactively initiates requests to the client, cross-cluster connectivity and network protection ACL on the application side need to be considered on the network. Compared with Push, the network connectivity is simple, and it only requires the server to provide a domain name/VIP that each node can access.

8. Pull or Push Selection

Among the current open-source solutions, the representative of the Pull pattern is the Prometheus family solution (referred to as family mainly because the default single point's Prometheus extensibility is limited. Prometheus has many distributed solutions in the technical community, such as Thanos, VictoriaMetrics, and Cortex). TICK (namely Telegraf, InfluxDB, Chronograf, Kapacitor) solution of InfluxDB represents the Push pattern. Both solutions have advantages and disadvantages. Against the backdrop of cloud-native, Prometheus begins to thrive with the popularity of CNCF and Kubernetes. Therefore, much open-source software has already started to provide Pull ports in Prometheus mode. However, many systems have had difficulty giving Pull ports since the beginning of their design. The Push Agent method is more appropriate for monitoring these systems.

There has been no conclusion about whether the application should use Pull or Push. The specific selection still needs to be based on the actual scenarios within the company. For example, if the network cluster of the company is very complex, it is simple to use Push. Many applications with a short life cycle need to use Push. Mobile applications can only use Push. The system itself uses Consul for service discovery. It is easy to implement after exposing the Pull port.

Therefore, for the monitoring system within the company, having the capabilities of both Pull and Push is the optimal solution:

Host, process, and middleware monitoring use Push Agent
Kubernetes and other directly exposed pull ports use the Pull mode
Applications select Pull or Push based on the actual scenario

9. Log Service Policies on Pull and Push

Alibaba Cloud Log Service (SLS) supports unified storage and log analysis, time-series monitoring, and distributed tracing analysis. The time series monitoring solution is compatible with the format standards of Prometheus and provides the standard PromQL grammar. Hundreds of thousands of SLS users have various application scenarios. It is impossible to use Pull or Push alone to meet all customer needs. Therefore, SLS does not select only Pull or Push but is compatible with both Pull and Push models. In addition, for open source communities and agents, SLS's strategy is to be compatible with the open-source ecosystem rather than creating a closed ecosystem.

Pull Model: fully compatible with the Pull Scrap capability of Prometheus. We can use the Remote Write of Prometheus to use Prometheus as the Pull Agent. VMAgent that has the same capability as Prometheus Scrap can also be used this way. Agent Logtail of SLS can also implement the Prometheus Scrap capability.
Push Model: Currently, Telegraf has the best ecosystem for monitoring Push Agent in the industry. Logtail of SLS has built-in Telegraf, which can support all the monitoring plug-ins of Telegraf.

Compared with Pull Agents such as VMAgent, Prometheus, and native Telegraf, SLS provides the most urgent agent configuration center and monitoring capabilities. SLS can manage the collection configuration of each agent and monitor the running status of these agents on the server, minimizing the O&M management costs.

Therefore, using SLS for building monitoring solution will be very simple in practice. We just need to take the following steps:

Create a MetricStore in the SLS console (Web page) to store monitoring data.
Deploy the Agent of Logtail (with one line of command).
Configure the collection configuration of monitoring data on the console (applicable to both Pull and Push).

10. Summary

This article introduces the Pull or Push selection in the monitoring system. The author compared Pull and Push from various aspects based on several years of practical experience accumulated in multiple customer scenarios. It would be helpful in the monitoring system construction process. We would like to hear your thoughts and queries on this topic.

References

Community

Pull or Push: How to Select Monitoring Systems?

1. Types of Monitoring Systems

2. Pull or Push?

3. Pull and Push Monitoring Overview

4. Principle and Architecture Comparison

Pull Model Architecture

Push Model Architecture

Summary

5. Pull's Distributed Solution

6. Monitoring Capabilities Comparison

6.1 Monitoring Target Survivability

6.2 Data Completeness Calculation

6.3 Short Lifecycle/Serverless Application Monitoring

6.4 Flexibility and Coupling

7. Resource and O&M Cost

7.1 Resource Cost

7.2 O&M Cost

8. Pull or Push Selection

9. Log Service Policies on Pull and Push

10. Summary

Read previous post:

Read next post:

DavidZhang

You may also like

Comments

DavidZhang

Related Products

Managed Service for Grafana

Bastionhost

Simple Log Service

Managed Service for Prometheus