Metrics - Realtime Compute for Apache Flink - Alibaba Cloud Documentation Center

This topic describes the metrics supported by fully managed Flink.

Notes

Data discrepancies between Cloud Monitor and the Flink console

Differences in display dimensions
The Flink console uses Prometheus Query Language (PromQL) queries to display only the maximum latency. This is because in real-time computing scenarios, the average latency can easily mask critical issues such as data skew or single-partition blocking. Only the maximum latency provides valuable information for operations and maintenance (O&M).
Value drift
Cloud Monitor uses a pre-aggregation mechanism to calculate metrics. Because of differences in aggregation windows, sampling times, or calculation logic, the maximum value displayed in Cloud Monitor may be slightly different from the real-time value displayed in the Flink console. When troubleshooting, use the data from the Flink console as the standard.

Data latency and Watermark configuration

Latency calculation logic
The current monitoring metric, Emit Delay, is calculated based on event time. The formula is as follows:

Delay = Current system time - Logical time field in the data payload (for example, PriceData.time)
This means the metric reflects the freshness of the data, not the processing speed of the system. The metric value will be high if the data itself is old or if the system pauses output to wait for watermark alignment.
Recommended adjustments
Scenario 1: The business logic relies heavily on watermarks for correctness, but the data source is old
- Typical situations:
  - Upstream data transmission has inherent latency, such as slow instrumentation reporting.
  - Historical data is being backfilled, processing data from the previous day.
  - The business logic must rely on watermarks to handle out-of-order events and cannot be disabled.
- Symptom: Monitoring alerts show high latency, but the Kafka consumer group has no message accumulation (Lag ≈ 0) and the CPU load is low.
- Recommendations:
  1. Ignore this latency metric: A high delay in this case is normal business behavior because it reflects that the data is old. This does not indicate a system fault.
  2. Change the monitoring metric: O&M engineers should monitor the Kafka Consumer Lag (message accumulation) instead. As long as the accumulation does not continuously increase, the system's processing capability is normal and no intervention is needed.
Scenario 2: Real-time performance is prioritized, and minor out-of-order events or data loss is tolerable
- Typical situations:
  - For dashboards or real-time risk control, output is slow because data is waiting for watermarks.
  - The business cares more about 'when the data was received' than 'the timestamp within the data'.
- Symptom: The data stream is real-time, but because the watermark is configured with a large toleration window, such as allowing a 10-second delay, the data output is delayed by 10 seconds.
- Recommendations:
  1. Remove or disable watermarks: You can switch to processing time for calculations or set the watermark waiting threshold to 0.
  2. Expected result: The latency metric will drop instantly, close to the physical processing time, and data will be processed and output as it arrives without waiting for alignment.

Metric characteristics in typical scenarios

Metrics reflect only the current state of a component and are not sufficient to determine the root cause of a problem. You should always use the backpressure detection feature in the Flink UI and other supporting tools for a comprehensive diagnosis.

1. Operator backpressure

Symptom: Insufficient downstream processing capacity causes the source's sending rate to drop.

Detection method: Use the backpressure monitoring panel in the Flink UI.
Metric characteristics:
- sourceIdleTime periodically increases.
- currentFetchEventTimeLag and currentEmitEventTimeLag continuously increase.
- Extreme case: If an operator is completely stuck, sourceIdleTime will continuously increase.

2. Source performance bottleneck

Symptom: The source's read speed has reached its limit but still cannot meet data processing demands.

Detection method: No backpressure is detected in the job.
Metric characteristics:
- sourceIdleTime remains at a very low value, which indicates that the source is operating at full capacity.
- currentFetchEventTimeLag and currentEmitEventTimeLag have similar and high values.

3. Data skew or empty partitions

Symptom: Data is unevenly distributed across upstream Kafka partitions, or empty partitions exist.

Detection method: Observe the metric differences between various sources.
Metric characteristics:
- The sourceIdleTime of a specific source is significantly higher than others, which indicates that its degree of parallelism is idle.

4. Data latency analysis

Symptom: The overall job latency is high, and you need to locate whether the bottleneck is within the source or in an external system.

Detection method: Analyze the combination of idle time, lag difference, and message accumulation.
Metric characteristics:
- High sourceIdleTime:
  This indicates that the source is idle. It usually means that the data output rate of the external system is low, not that Flink processing is slow.
- Lag difference analysis:
  Compare the difference between currentEmitEventTimeLag and currentFetchEventTimeLag, which is the time data resides within the source operator:
  - Small difference (the two metrics are close): Insufficient pull capability. The bottleneck is in the network I/O bandwidth or an insufficient source degree of parallelism.
  - Large difference: Insufficient processing capability. The bottleneck is inefficient data parsing or limitations from downstream backpressure.
- pendingRecords (if supported by the connector):
  This metric directly reflects the amount of data retained externally. A higher value indicates more severe data accumulation in the external system.

Overview

Metric	Definition	Details	Unit	Supported connectors
Num of Restarts	The number of times the job restarted due to errors.	The number of times a job restarts due to errors, excluding JobManager (JM) failovers. Use this to check job availability and status.	Count	Not applicable
current Emit Event Time Lag	Business latency.	A high value indicates potential latency in data pulling or processing.	ms	Kafka RocketMQ SLS DataHub Postgres CDC Hologres (Binlog Source)
current Fetch Event Time Lag	Transmission latency.	A high value indicates potential latency in data pulling. Check the network I/O or upstream system. By comparing this with currentEmitEventTimeLag, you can analyze the source's processing capability based on the difference (the time data stays in the source). Details are as follows: If the two latency values are very close, the source's ability to pull data from the external system is insufficient due to network I/O or concurrency issues. If the difference between the two latency values is large, the job's processing capability is insufficient, causing data to be retained in the source. On the details page of the target job, click the Status tab. On the BackPressure page, locate the problematic Vertex topology. Then, go to the Thread Dump page to analyze the stack and identify the bottleneck.	ms	Kafka RocketMQ SLS DataHub Postgres CDC Hologres (Binlog Source)
numRecordsIn	Total number of input records for all operators.	If the numRecordsIn value for an operator does not increase for a long time, the upstream might have consumed all the data. Check the upstream data.	Item	All built-in connectors are supported.
numRecordsOut	Total number of output records.	If the numRecordsOut value for an operator does not increase for a long time, there might be a logic error in the job code causing data to be dropped. Check the job code logic.	Items	All built-in connectors are supported.
numRecordsInofSource	Input records for the source operator only.	Check the upstream data input status.	Items	Kafka MaxCompute Incremental MaxCompute RocketMQ SLS DataHub ElasticSearch Hologres
numRecordsOutOfSink	Total number of output records from the sink.	Check the data output status.	Count	Kafka SLS DataHub Hologres HBase Tablestore Redis
numRecordsInPerSecond	Number of input records per second for the entire data stream.	Use this for scenarios that require monitoring the processing speed of the entire data stream. For example, you can use numRecordsInPerSecond to observe whether the processing speed of the entire data stream meets expectations and how performance changes under different input data loads.	records/s	All built-in connectors are supported.
numRecordsOutPerSecond	Number of output records per second for the entire data stream.	Measures the number of records output per second for the entire data stream. Use this for scenarios that require monitoring the output speed of the entire data stream. For example, you can use numRecordsOutPerSecond to observe whether the output speed of the entire data stream meets expectations and how performance changes under different output data loads.	records/s	All connectors are supported.
numRecordsInOfSourcePerSecond (IN RPS)	Number of input records per second at the data source.	Measures the number of records generated per second by each data source. This is useful for understanding the generation speed of each source. For example, in a data stream, different data sources may produce different numbers of records. Use numRecordsInOfSourcePerSecond to understand the generation speed of each data source and adjust the data stream for better performance. This data is also used for monitoring and alerting. If this value is 0, the upstream might have consumed all the data, or output is blocked because upstream data has not been consumed. Check the upstream data.	records/s	Kafka MaxCompute Incremental MaxCompute RocketMQ SLS DataHub ElasticSearch Hologres
numRecordsOutOfSinkPerSecond (OUT RPS)	Number of output records per second at the data sink.	Measures the number of records output per second by each sink. This is useful for understanding the output speed of each sink. For example, in a data stream, different sinks may output different numbers of records. Use numRecordsOutOfSinkPerSecond to understand the output speed of each sink and adjust the data stream for better performance. This data is used for monitoring and alerting. If this value is 0, there might be a logic error in the job code that filters out all data. Check the job code logic.	records/s	Kafka MaxCompute Incremental MaxCompute SLS DataHub Hologres HBase Tablestore Redis
pendingRecords	Number of unread records at the source.	The number of data records in the external system that have not yet been pulled by the source.	Item	Kafka ElasticSearch
sourceIdleTime	The duration that data has remained unprocessed at the source.	This metric indicates whether the source is idle. A large value indicates that the data generation rate in the external system is low.	ms	Kafka RocketMQ Postgres CDC Hologres (Binlog Source)

System checkpoints

Metric	Definition	Details	Unit
Num of Checkpoints	Number of checkpoints.	Provides an overview of checkpoint status to help you set up checkpoint alerts.	Item
lastCheckpointDuration	Duration of the last checkpoint.	If a checkpoint takes too long or times out, it may be due to a large state, temporary network issues, unaligned barriers, or data backpressure.	ms
lastCheckpointSize	Size of the last checkpoint.	The actual uploaded size of the last checkpoint. This helps analyze checkpoint performance when bottlenecks occur.	Bytes

State

Note

The state latency metrics are available only after you enable them. In the advanced Flink configurations, set state.backend.latency-track.keyed-state-enabled: true. Enabling state latency metrics may affect the runtime performance of the job.

Metric	Definition	Details	Unit	Version limitations
State Clear Latency	Maximum latency of a single state clear operation.	View the performance of state cleanup.	Nanosecond (ns)	VVR 4.0.0 or later.
Value State Latency	Maximum latency of a single Value State access.	View the performance of Value State access.	ns
Aggregating State Latency	Maximum latency of a single Aggregating State access.	View the performance of Aggregating State access.	ns
Reducing State Latency	Maximum latency of a single Reducing State access.	View the performance of Reducing State access.	ns
Map State Latency	Maximum latency of a single Map State access.	View the performance of Map State access.	ns
List State Latency	Maximum latency of a single List State access.	View the performance of List State access.	ns
Sorted Map State Latency	Maximum latency of a single Sorted Map State access.	View the performance of Sorted Map State access.	ns
State Size	Size of the state data.	By observing this metric, you can: Directly or proactively locate nodes that may have state bottlenecks. Determine if TTL is effective.	Bytes	VVR 4.0.12 or later.
State File Size	Size of the state data file.	By observing this metric, you can: View the disk space occupied by the state on the local disk and take measures in advance if the size is large. Determine if insufficient local disk space is caused by excessively large state data.	Bytes	VVR 4.0.13 or later.

I/O

Metric	Definition	Details	Unit	Supported connectors
numBytesIn	Total input bytes.	View the input throughput from the upstream to observe job traffic.	Bytes	Kafka MaxCompute Incremental MaxCompute RocketMQ
numBytesInPerSecond	Total input bytes per second.	View the input stream rate from the upstream to observe job traffic.	Bytes/s	Kafka MaxCompute Incremental MaxCompute RocketMQ
numBytesOut	Total output bytes.	View the output throughput to observe job traffic.	Bytes	Kafka RocketMQ DataHub HBase
numBytesOutPerSecond	Total output bytes per second.	View the output throughput rate to observe job traffic.	Bytes/s	Kafka RocketMQ DataHub HBase
Task numRecords I/O	The total data volume that each Subtask receives and outputs.	Use this metric to determine if the job has an I/O bottleneck.	Items	Kafka MaxCompute Incremental MaxCompute SLS DataHub ElasticSearch Hologres HBase Tablestore Redis
Task numRecords I/O PerSecond	The total data volume received and sent by each Subtask per second.	Determine if the job has an I/O bottleneck and assess its severity based on the rate.	records/s	Kafka MaxCompute Incremental MaxCompute SLS DataHub ElasticSearch Hologres HBase Tablestore Redis
currentSendTime	Time taken for each subtask to send the last record to the downstream system.	A small value for this metric indicates that the subtask output is slow.	ms	Kafka MaxCompute Incremental MaxCompute RocketMQ SLS DataHub Hologres Note Supported in JDBC and RPC modes. Not supported in BHClient mode. HBase Tablestore Redis

Watermark

Metric	Definition	Details	Unit	Supported connectors
Task InputWatermark	Time when each task received the latest watermark.	Indicates the data receiving latency at the TM.	None	Not applicable to connectors
watermarkLag	Watermark latency.	Determine the job latency at the subtask level.	ms	Kafka RocketMQ SLS DataHub Hologres (Binlog Source)

CPU

Metric	Definition	Details	Unit
JM CPU Usage	CPU usage of a single JM.	This value reflects Flink's use of CPU time slices. 100% means one CPU core is fully used. 400% means four cores are fully used. If this value is consistently above 100%, the CPU is busy. If the load is high but CPU usage is low, it may be due to too many processes in an uninterruptible sleep state caused by frequent I/O operations. Note This metric is supported only in VVR 6.0.6 or later.	None
TM CPU Usage	CPU usage of a single TM.	This value reflects Flink's use of CPU time slices. 100% means one CPU core is fully used. 400% means four cores are fully used. If this value is consistently above 100%, the CPU is busy. If the load is high but CPU usage is low, it may be due to too many processes in an uninterruptible sleep state caused by frequent I/O operations.	None

Memory

Metric	Definition	Details	Unit
JM Heap Memory	Heap memory of the JM.	View changes in the JM's heap memory.	Bytes
JM NonHeap Memory	Non-heap memory of the JM.	View changes in the JM's non-heap memory.	Bytes
TM Heap Memory	Heap memory of the TM.	View changes in the TM's heap memory.	Bytes
TM nonHeap Memory	Non-heap memory of the TM.	View changes in the TM's non-heap memory.	Bytes
TM Mem (RSS)	Memory of the entire process, obtained via Linux.	View changes in the process memory.	Bytes

JVM

Metric	Definition	Details	Unit
JM Threads	Number of JM threads.	Too many JM threads can consume excessive memory and reduce job stability.	Item
TM Threads	Number of TM threads.	Too many TM threads can consume excessive memory and reduce job stability.	Unit
JM GC Count	Number of JM GC events.	Too many GC events can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures.	Times
JM GC Time	Duration of each JM GC event.	Long GC times can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures.	ms
TM GC Count	Number of TM GC events.	Too many GC events can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot task-level failures.	Count
TM GC Time	Duration of each TM GC event.	Long GC times can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures.	ms
JM ClassLoader	Total number of classes loaded or unloaded by the JM's JVM since its creation.	If the total number of loaded or unloaded classes by the JM's JVM is too large, it can consume excessive memory and affect job performance.	None
TM ClassLoader	Total number of classes loaded or unloaded by the TM's JVM since its creation.	Loading or unloading many classes in the JVM for the JM can cause excessive memory consumption and degrade job performance.	None

Connector - Mysql

Metric	Definition	Unit	Application Scenario	Version Limitations
isSnapshotting	Indicates whether the job is in the full data processing phase (1 means it is).	None	Determine the job processing phase.	VVR 8.0.9 or later.
isBinlogReading	Indicates whether the job is in the incremental data processing phase (1 means it is).	None	Determine the job processing phase.
Num of remaining tables	Number of tables waiting to be processed in the full data phase.	Count	View the number of remaining unprocessed tables.
Num of snapshotted tables	Number of tables already processed in the full data phase.	Unit	View the number of processed tables.
Num of remaining SnapshotSplits	Number of shards waiting to be processed in the full data phase.	Item	Viewing the processed shard count
Num of processed SnapshotSplits	Number of shards already processed in the full data phase.	Unit	Check the number of unprocessed shards
currentFetchEventTimeLag	Latency between data generation and being read from the database.	ms	View the latency of reading binary logs from the database.
currentReadTimestampMs	Timestamp of the latest data record read.	ms	View the time of the latest data read.
numRecordsIn	Number of records already read.	Items	View the total volume of processed data.
numSnapshotRecords	Number of records processed in the full data phase.	Items	View the data volume processed in the full data phase.
numRecordsInPerTable	Number of records read from each table.	Count	View the total data volume processed for each table.
numSnapshotRecordsPerTable	Number of records processed for each table in the full data phase.	Count	View the data volume processed for each table in the full data phase.

Connector - Kafka

Metric	Definition	Unit	Application Scenario	Limitations
commitsSucceeded	Successful offset commit count	Count	Verify that the offset commit was successful.	VVR 8.0.9 or later.
commitsFailed	Number of failed offset commits.	Count	Verify that the offset commit is successful.
Fetch Rate	Data pull frequency.	times/s	Determine data pull latency and speed.
Fetch Latency Avg	Average latency of data pull operations.	ms	Determine data pull latency and speed.
Fetch Size Avg	Average bytes per pull.	Bytes	Determine data pull latency and speed.
Avg Records In Per-Request	Average message count per pull	Items	Determine data pull latency and speed.
currentSendTime	Time the last record was sent.	None	Determine consumption progress.
batchSizeAvg	Average bytes per batch.	Bytes	Determine data write latency and speed.
requestLatencyAvg	Average request latency.	ms	Determine data write latency and speed.
requestsInFlight	Number of in-flight requests.	None	Determine data write latency and speed.
recordsPerRequestAvg	Average messages per request	Items	Determine data write latency and speed.
recordSizeAvg	Average message size (in bytes)	Bytes	Determine data write latency and speed.

Connector - Paimon

Metric	Definition	Unit	Application Scenario	Limitations
Number of Writers	Number of writer instances.	Count	Indicates how many buckets are currently being written to. A large number can affect write efficiency and increase memory consumption. Analyze whether the number of buckets or partition key settings are reasonable.	VVR 8.0.9 or later.
Max Compaction Thread Busy	Maximum busy level of small file compaction threads.	Ratio	Among the buckets currently being written to, this is the maximum percentage of time that compaction threads were active in the last minute. It reflects the pressure on small file compaction.
Average Compaction Thread Busy	Average busy level of small file compaction threads.	Ratio	Among the buckets currently being written to, this is the average percentage of time that compaction threads were active in the last minute. It reflects the pressure on small file compaction.
Max Number of Level 0 Files	Maximum number of Level 0 files.	Item	The maximum number of Level 0 files (small files) among the buckets currently being written to. This is only meaningful for primary key tables and reflects whether the compaction efficiency can keep up with the write efficiency.
Average Number of Level 0 Files	Average number of Level 0 files.	Count	The average number of Level 0 files (small files) among the buckets currently being written to. This is only meaningful for primary key tables and reflects whether the compaction efficiency can keep up with the write efficiency.
Last Commit Duration	Duration of the last commit.	ms	If the duration is too long, check if too many buckets are being written to simultaneously.
Number of Partitions Last Committed	Number of partitions written to in the last commit.	Item	A large number can affect write efficiency and increase memory consumption. Analyze whether the number of buckets or partition key settings are reasonable.
Number of Buckets Last Committed	Number of buckets written to in the last commit.	Item	A large number can affect write efficiency and increase memory consumption. Analyze whether the number of buckets or partition key settings are reasonable.
Used Write Buffer	Used write buffer memory size.	Bytes	The used buffer size for writer nodes across all task managers. This buffer occupies Java heap memory. If set too large, it may cause an out-of-memory (OOM) error.
Total Write Buffer	Total allocated write buffer memory size.	Bytes	The configured buffer size for writer nodes across all task managers. This buffer occupies Java heap memory. If set too large, it may cause an OOM error.

Data ingestion

Metric	Definition	Unit	Application Scenario	Version limitations
isSnapshotting	Indicates whether the job is in the full data processing phase (1 means it is).	None	Determine the job processing phase.	VVR 8.0.9 or later.
isBinlogReading	Indicates whether the job is in the incremental data processing phase (1 means it is).	None	Determine the job processing phase.
Num of remaining tables	Number of tables waiting to be processed in the full data phase.	Item	View the number of remaining unprocessed tables.
Num of snapshotted tables	Number of tables already processed in the full data phase.	Count	View the number of processed tables.
Num of remaining SnapshotSplits	Number of shards waiting to be processed in the full data phase.	Unit	Viewing the number of processed shards
Num of processed SnapshotSplits	Number of shards already processed in the full data phase.	item	Checking the number of unprocessed shards
currentFetchEventTimeLag	Latency between data generation and being read from the database.	ms	View the latency of reading binary logs from the database.
currentReadTimestampMs	Timestamp of the latest data record read.	ms	View the time of the latest data read.
numRecordsIn	Number of records already read.	Items	View the total volume of processed data.
numRecordsInPerTable	Number of records read from each table.	Item	View the total data volume processed for each table.
numSnapshotRecords	Number of records processed in the full data phase.	Items	View the data volume processed in the full data phase.
numSnapshotRecordsPerTable	Number of records processed for each table in the full data phase.	Entries	View the data volume processed for each table in the full data phase.