This topic describes the metrics supported by fully managed Flink.
Notes
Data discrepancies between Cloud Monitor and the Flink console
Differences in display dimensions
The Flink console uses Prometheus Query Language (PromQL) queries to display only the maximum latency. This is because in real-time computing scenarios, the average latency can easily mask critical issues such as data skew or single-partition blocking. Only the maximum latency provides valuable information for operations and maintenance (O&M).Value drift
Cloud Monitor uses a pre-aggregation mechanism to calculate metrics. Because of differences in aggregation windows, sampling times, or calculation logic, the maximum value displayed in Cloud Monitor may be slightly different from the real-time value displayed in the Flink console. When troubleshooting, use the data from the Flink console as the standard.
Data latency and Watermark configuration
Latency calculation logic
The current monitoring metric, Emit Delay, is calculated based on event time. The formula is as follows:Delay = Current system time - Logical time field in the data payload (for example, PriceData.time)
This means the metric reflects the freshness of the data, not the processing speed of the system. The metric value will be high if the data itself is old or if the system pauses output to wait for watermark alignment.
Recommended adjustments
Scenario 1: The business logic relies heavily on watermarks for correctness, but the data source is old
Typical situations:
Upstream data transmission has inherent latency, such as slow instrumentation reporting.
Historical data is being backfilled, processing data from the previous day.
The business logic must rely on watermarks to handle out-of-order events and cannot be disabled.
Symptom: Monitoring alerts show high latency, but the Kafka consumer group has no message accumulation (Lag ≈ 0) and the CPU load is low.
Recommendations:
Ignore this latency metric: A high delay in this case is normal business behavior because it reflects that the data is old. This does not indicate a system fault.
Change the monitoring metric: O&M engineers should monitor the Kafka Consumer Lag (message accumulation) instead. As long as the accumulation does not continuously increase, the system's processing capability is normal and no intervention is needed.
Scenario 2: Real-time performance is prioritized, and minor out-of-order events or data loss is tolerable
Typical situations:
For dashboards or real-time risk control, output is slow because data is waiting for watermarks.
The business cares more about 'when the data was received' than 'the timestamp within the data'.
Symptom: The data stream is real-time, but because the watermark is configured with a large toleration window, such as allowing a 10-second delay, the data output is delayed by 10 seconds.
Recommendations:
Remove or disable watermarks: You can switch to processing time for calculations or set the watermark waiting threshold to 0.
Expected result: The latency metric will drop instantly, close to the physical processing time, and data will be processed and output as it arrives without waiting for alignment.
Metric characteristics in typical scenarios
Metrics reflect only the current state of a component and are not sufficient to determine the root cause of a problem. You should always use the backpressure detection feature in the Flink UI and other supporting tools for a comprehensive diagnosis.
1. Operator backpressure
Symptom: Insufficient downstream processing capacity causes the source's sending rate to drop.
Detection method: Use the backpressure monitoring panel in the Flink UI.
Metric characteristics:
sourceIdleTime periodically increases.
currentFetchEventTimeLag and currentEmitEventTimeLag continuously increase.
Extreme case: If an operator is completely stuck, sourceIdleTime will continuously increase.
2. Source performance bottleneck
Symptom: The source's read speed has reached its limit but still cannot meet data processing demands.
Detection method: No backpressure is detected in the job.
Metric characteristics:
sourceIdleTime remains at a very low value, which indicates that the source is operating at full capacity.
currentFetchEventTimeLag and currentEmitEventTimeLag have similar and high values.
3. Data skew or empty partitions
Symptom: Data is unevenly distributed across upstream Kafka partitions, or empty partitions exist.
Detection method: Observe the metric differences between various sources.
Metric characteristics:
The sourceIdleTime of a specific source is significantly higher than others, which indicates that its degree of parallelism is idle.
4. Data latency analysis
Symptom: The overall job latency is high, and you need to locate whether the bottleneck is within the source or in an external system.
Detection method: Analyze the combination of idle time, lag difference, and message accumulation.
Metric characteristics:
High sourceIdleTime:
This indicates that the source is idle. It usually means that the data output rate of the external system is low, not that Flink processing is slow.Lag difference analysis:
Compare the difference between currentEmitEventTimeLag and currentFetchEventTimeLag, which is the time data resides within the source operator:Small difference (the two metrics are close): Insufficient pull capability. The bottleneck is in the network I/O bandwidth or an insufficient source degree of parallelism.
Large difference: Insufficient processing capability. The bottleneck is inefficient data parsing or limitations from downstream backpressure.
pendingRecords (if supported by the connector):
This metric directly reflects the amount of data retained externally. A higher value indicates more severe data accumulation in the external system.