All Products
Search
Document Center

Realtime Compute for Apache Flink:Monitoring Metrics

Last Updated:Jan 21, 2026

This topic describes the metrics supported by fully managed Flink.

Notes

Data discrepancies between Cloud Monitor and the Flink console

  1. Differences in display dimensions
    The Flink console uses Prometheus Query Language (PromQL) queries to display only the maximum latency. This is because in real-time computing scenarios, the average latency can easily mask critical issues such as data skew or single-partition blocking. Only the maximum latency provides valuable information for operations and maintenance (O&M).

  2. Value drift
    Cloud Monitor uses a pre-aggregation mechanism to calculate metrics. Because of differences in aggregation windows, sampling times, or calculation logic, the maximum value displayed in Cloud Monitor may be slightly different from the real-time value displayed in the Flink console. When troubleshooting, use the data from the Flink console as the standard.

Data latency and Watermark configuration

  1. Latency calculation logic
    The current monitoring metric, Emit Delay, is calculated based on event time. The formula is as follows:

    Delay = Current system time - Logical time field in the data payload (for example, PriceData.time)

    This means the metric reflects the freshness of the data, not the processing speed of the system. The metric value will be high if the data itself is old or if the system pauses output to wait for watermark alignment.

  2. Recommended adjustments

    Scenario 1: The business logic relies heavily on watermarks for correctness, but the data source is old

    • Typical situations:

      • Upstream data transmission has inherent latency, such as slow instrumentation reporting.

      • Historical data is being backfilled, processing data from the previous day.

      • The business logic must rely on watermarks to handle out-of-order events and cannot be disabled.

    • Symptom: Monitoring alerts show high latency, but the Kafka consumer group has no message accumulation (Lag ≈ 0) and the CPU load is low.

    • Recommendations:

      1. Ignore this latency metric: A high delay in this case is normal business behavior because it reflects that the data is old. This does not indicate a system fault.

      2. Change the monitoring metric: O&M engineers should monitor the Kafka Consumer Lag (message accumulation) instead. As long as the accumulation does not continuously increase, the system's processing capability is normal and no intervention is needed.

    Scenario 2: Real-time performance is prioritized, and minor out-of-order events or data loss is tolerable

    • Typical situations:

      • For dashboards or real-time risk control, output is slow because data is waiting for watermarks.

      • The business cares more about 'when the data was received' than 'the timestamp within the data'.

    • Symptom: The data stream is real-time, but because the watermark is configured with a large toleration window, such as allowing a 10-second delay, the data output is delayed by 10 seconds.

    • Recommendations:

      1. Remove or disable watermarks: You can switch to processing time for calculations or set the watermark waiting threshold to 0.

      2. Expected result: The latency metric will drop instantly, close to the physical processing time, and data will be processed and output as it arrives without waiting for alignment.

Metric characteristics in typical scenarios

Metrics reflect only the current state of a component and are not sufficient to determine the root cause of a problem. You should always use the backpressure detection feature in the Flink UI and other supporting tools for a comprehensive diagnosis.

1. Operator backpressure

Symptom: Insufficient downstream processing capacity causes the source's sending rate to drop.

  • Detection method: Use the backpressure monitoring panel in the Flink UI.

  • Metric characteristics:

    • sourceIdleTime periodically increases.

    • currentFetchEventTimeLag and currentEmitEventTimeLag continuously increase.

    • Extreme case: If an operator is completely stuck, sourceIdleTime will continuously increase.

2. Source performance bottleneck

Symptom: The source's read speed has reached its limit but still cannot meet data processing demands.

  • Detection method: No backpressure is detected in the job.

  • Metric characteristics:

    • sourceIdleTime remains at a very low value, which indicates that the source is operating at full capacity.

    • currentFetchEventTimeLag and currentEmitEventTimeLag have similar and high values.

3. Data skew or empty partitions

Symptom: Data is unevenly distributed across upstream Kafka partitions, or empty partitions exist.

  • Detection method: Observe the metric differences between various sources.

  • Metric characteristics:

    • The sourceIdleTime of a specific source is significantly higher than others, which indicates that its degree of parallelism is idle.

4. Data latency analysis

Symptom: The overall job latency is high, and you need to locate whether the bottleneck is within the source or in an external system.

  • Detection method: Analyze the combination of idle time, lag difference, and message accumulation.

  • Metric characteristics:

    • High sourceIdleTime:
      This indicates that the source is idle. It usually means that the data output rate of the external system is low, not that Flink processing is slow.

    • Lag difference analysis:
      Compare the difference between currentEmitEventTimeLag and currentFetchEventTimeLag, which is the time data resides within the source operator:

      • Small difference (the two metrics are close): Insufficient pull capability. The bottleneck is in the network I/O bandwidth or an insufficient source degree of parallelism.

      • Large difference: Insufficient processing capability. The bottleneck is inefficient data parsing or limitations from downstream backpressure.

    • pendingRecords (if supported by the connector):
      This metric directly reflects the amount of data retained externally. A higher value indicates more severe data accumulation in the external system.

Overview

Metric

Definition

Details

Unit

Supported connectors

Num of Restarts

The number of times the job restarted due to errors.

The number of times a job restarts due to errors, excluding JobManager (JM) failovers. Use this to check job availability and status.

Count

Not applicable

current Emit Event Time Lag

Business latency.

A high value indicates potential latency in data pulling or processing.

ms

  • Kafka

  • RocketMQ

  • SLS

  • DataHub

  • Postgres CDC

  • Hologres (Binlog Source)

current Fetch Event Time Lag

Transmission latency.

A high value indicates potential latency in data pulling. Check the network I/O or upstream system. By comparing this with currentEmitEventTimeLag, you can analyze the source's processing capability based on the difference (the time data stays in the source). Details are as follows:

  • If the two latency values are very close, the source's ability to pull data from the external system is insufficient due to network I/O or concurrency issues.

  • If the difference between the two latency values is large, the job's processing capability is insufficient, causing data to be retained in the source. On the details page of the target job, click the Status tab. On the BackPressure page, locate the problematic Vertex topology. Then, go to the Thread Dump page to analyze the stack and identify the bottleneck.

ms

  • Kafka

  • RocketMQ

  • SLS

  • DataHub

  • Postgres CDC

  • Hologres (Binlog Source)

numRecordsIn

Total number of input records for all operators.

If the numRecordsIn value for an operator does not increase for a long time, the upstream might have consumed all the data. Check the upstream data.

Item

All built-in connectors are supported.

numRecordsOut

Total number of output records.

If the numRecordsOut value for an operator does not increase for a long time, there might be a logic error in the job code causing data to be dropped. Check the job code logic.

Items

All built-in connectors are supported.

numRecordsInofSource

Input records for the source operator only.

Check the upstream data input status.

Items

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • RocketMQ

  • SLS

  • DataHub

  • ElasticSearch

  • Hologres

numRecordsOutOfSink

Total number of output records from the sink.

Check the data output status.

Count

  • Kafka

  • SLS

  • DataHub

  • Hologres

  • HBase

  • Tablestore

  • Redis

numRecordsInPerSecond

Number of input records per second for the entire data stream.

Use this for scenarios that require monitoring the processing speed of the entire data stream. For example, you can use numRecordsInPerSecond to observe whether the processing speed of the entire data stream meets expectations and how performance changes under different input data loads.

records/s

All built-in connectors are supported.

numRecordsOutPerSecond

Number of output records per second for the entire data stream.

Measures the number of records output per second for the entire data stream. Use this for scenarios that require monitoring the output speed of the entire data stream.

For example, you can use numRecordsOutPerSecond to observe whether the output speed of the entire data stream meets expectations and how performance changes under different output data loads.

records/s

All connectors are supported.

numRecordsInOfSourcePerSecond (IN RPS)

Number of input records per second at the data source.

Measures the number of records generated per second by each data source. This is useful for understanding the generation speed of each source. For example, in a data stream, different data sources may produce different numbers of records. Use numRecordsInOfSourcePerSecond to understand the generation speed of each data source and adjust the data stream for better performance. This data is also used for monitoring and alerting.

If this value is 0, the upstream might have consumed all the data, or output is blocked because upstream data has not been consumed. Check the upstream data.

records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • RocketMQ

  • SLS

  • DataHub

  • ElasticSearch

  • Hologres

numRecordsOutOfSinkPerSecond (OUT RPS)

Number of output records per second at the data sink.

Measures the number of records output per second by each sink. This is useful for understanding the output speed of each sink. For example, in a data stream, different sinks may output different numbers of records.

Use numRecordsOutOfSinkPerSecond to understand the output speed of each sink and adjust the data stream for better performance. This data is used for monitoring and alerting. If this value is 0, there might be a logic error in the job code that filters out all data. Check the job code logic.

records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • SLS

  • DataHub

  • Hologres

  • HBase

  • Tablestore

  • Redis

pendingRecords

Number of unread records at the source.

The number of data records in the external system that have not yet been pulled by the source.

Item

  • Kafka

  • ElasticSearch

sourceIdleTime

The duration that data has remained unprocessed at the source.

This metric indicates whether the source is idle. A large value indicates that the data generation rate in the external system is low.

ms

  • Kafka

  • RocketMQ

  • Postgres CDC

  • Hologres (Binlog Source)

System checkpoints

Metric

Definition

Details

Unit

Num of Checkpoints

Number of checkpoints.

Provides an overview of checkpoint status to help you set up checkpoint alerts.

Item

lastCheckpointDuration

Duration of the last checkpoint.

If a checkpoint takes too long or times out, it may be due to a large state, temporary network issues, unaligned barriers, or data backpressure.

ms

lastCheckpointSize

Size of the last checkpoint.

The actual uploaded size of the last checkpoint. This helps analyze checkpoint performance when bottlenecks occur.

Bytes

State

Note

The state latency metrics are available only after you enable them. In the advanced Flink configurations, set state.backend.latency-track.keyed-state-enabled: true. Enabling state latency metrics may affect the runtime performance of the job.

Metric

Definition

Details

Unit

Version limitations

State Clear Latency

Maximum latency of a single state clear operation.

View the performance of state cleanup.

Nanosecond (ns)

VVR 4.0.0 or later.

Value State Latency

Maximum latency of a single Value State access.

View the performance of Value State access.

ns

Aggregating State Latency

Maximum latency of a single Aggregating State access.

View the performance of Aggregating State access.

ns

Reducing State Latency

Maximum latency of a single Reducing State access.

View the performance of Reducing State access.

ns

Map State Latency

Maximum latency of a single Map State access.

View the performance of Map State access.

ns

List State Latency

Maximum latency of a single List State access.

View the performance of List State access.

ns

Sorted Map State Latency

Maximum latency of a single Sorted Map State access.

View the performance of Sorted Map State access.

ns

State Size

Size of the state data.

By observing this metric, you can:

  • Directly or proactively locate nodes that may have state bottlenecks.

  • Determine if TTL is effective.

Bytes

VVR 4.0.12 or later.

State File Size

Size of the state data file.

By observing this metric, you can:

  • View the disk space occupied by the state on the local disk and take measures in advance if the size is large.

  • Determine if insufficient local disk space is caused by excessively large state data.

Bytes

VVR 4.0.13 or later.

I/O

Metric

Definition

Details

Unit

Supported connectors

numBytesIn

Total input bytes.

View the input throughput from the upstream to observe job traffic.

Bytes

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • RocketMQ

numBytesInPerSecond

Total input bytes per second.

View the input stream rate from the upstream to observe job traffic.

Bytes/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • RocketMQ

numBytesOut

Total output bytes.

View the output throughput to observe job traffic.

Bytes

  • Kafka

  • RocketMQ

  • DataHub

  • HBase

numBytesOutPerSecond

Total output bytes per second.

View the output throughput rate to observe job traffic.

Bytes/s

  • Kafka

  • RocketMQ

  • DataHub

  • HBase

Task numRecords I/O

The total data volume that each Subtask receives and outputs.

Use this metric to determine if the job has an I/O bottleneck.

Items

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • SLS

  • DataHub

  • ElasticSearch

  • Hologres

  • HBase

  • Tablestore

  • Redis

Task numRecords I/O PerSecond

The total data volume received and sent by each Subtask per second.

Determine if the job has an I/O bottleneck and assess its severity based on the rate.

records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • SLS

  • DataHub

  • ElasticSearch

  • Hologres

  • HBase

  • Tablestore

  • Redis

currentSendTime

Time taken for each subtask to send the last record to the downstream system.

A small value for this metric indicates that the subtask output is slow.

ms

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • RocketMQ

  • SLS

  • DataHub

  • Hologres

    Note

    Supported in JDBC and RPC modes. Not supported in BHClient mode.

  • HBase

  • Tablestore

  • Redis

Watermark

Metric

Definition

Details

Unit

Supported connectors

Task InputWatermark

Time when each task received the latest watermark.

Indicates the data receiving latency at the TM.

None

Not applicable to connectors

watermarkLag

Watermark latency.

Determine the job latency at the subtask level.

ms

  • Kafka

  • RocketMQ

  • SLS

  • DataHub

  • Hologres (Binlog Source)

CPU

Metric

Definition

Details

Unit

JM CPU Usage

CPU usage of a single JM.

This value reflects Flink's use of CPU time slices. 100% means one CPU core is fully used. 400% means four cores are fully used. If this value is consistently above 100%, the CPU is busy. If the load is high but CPU usage is low, it may be due to too many processes in an uninterruptible sleep state caused by frequent I/O operations.

Note

This metric is supported only in VVR 6.0.6 or later.

None

TM CPU Usage

CPU usage of a single TM.

This value reflects Flink's use of CPU time slices. 100% means one CPU core is fully used. 400% means four cores are fully used. If this value is consistently above 100%, the CPU is busy. If the load is high but CPU usage is low, it may be due to too many processes in an uninterruptible sleep state caused by frequent I/O operations.

None

Memory

Metric

Definition

Details

Unit

JM Heap Memory

Heap memory of the JM.

View changes in the JM's heap memory.

Bytes

JM NonHeap Memory

Non-heap memory of the JM.

View changes in the JM's non-heap memory.

Bytes

TM Heap Memory

Heap memory of the TM.

View changes in the TM's heap memory.

Bytes

TM nonHeap Memory

Non-heap memory of the TM.

View changes in the TM's non-heap memory.

Bytes

TM Mem (RSS)

Memory of the entire process, obtained via Linux.

View changes in the process memory.

Bytes

JVM

Metric

Definition

Details

Unit

JM Threads

Number of JM threads.

Too many JM threads can consume excessive memory and reduce job stability.

Item

TM Threads

Number of TM threads.

Too many TM threads can consume excessive memory and reduce job stability.

Unit

JM GC Count

Number of JM GC events.

Too many GC events can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures.

Times

JM GC Time

Duration of each JM GC event.

Long GC times can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures.

ms

TM GC Count

Number of TM GC events.

Too many GC events can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot task-level failures.

Count

TM GC Time

Duration of each TM GC event.

Long GC times can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures.

ms

JM ClassLoader

Total number of classes loaded or unloaded by the JM's JVM since its creation.

If the total number of loaded or unloaded classes by the JM's JVM is too large, it can consume excessive memory and affect job performance.

None

TM ClassLoader

Total number of classes loaded or unloaded by the TM's JVM since its creation.

Loading or unloading many classes in the JVM for the JM can cause excessive memory consumption and degrade job performance.

None

Connector - Mysql

Metric

Definition

Unit

Application Scenario

Version Limitations

isSnapshotting

Indicates whether the job is in the full data processing phase (1 means it is).

None

Determine the job processing phase.

VVR 8.0.9 or later.

isBinlogReading

Indicates whether the job is in the incremental data processing phase (1 means it is).

None

Determine the job processing phase.

Num of remaining tables

Number of tables waiting to be processed in the full data phase.

Count

View the number of remaining unprocessed tables.

Num of snapshotted tables

Number of tables already processed in the full data phase.

Unit

View the number of processed tables.

Num of remaining SnapshotSplits

Number of shards waiting to be processed in the full data phase.

Item

Viewing the processed shard count

Num of processed SnapshotSplits

Number of shards already processed in the full data phase.

Unit

Check the number of unprocessed shards

currentFetchEventTimeLag

Latency between data generation and being read from the database.

ms

View the latency of reading binary logs from the database.

currentReadTimestampMs

Timestamp of the latest data record read.

ms

View the time of the latest data read.

numRecordsIn

Number of records already read.

Items

View the total volume of processed data.

numSnapshotRecords

Number of records processed in the full data phase.

Items

View the data volume processed in the full data phase.

numRecordsInPerTable

Number of records read from each table.

Count

View the total data volume processed for each table.

numSnapshotRecordsPerTable

Number of records processed for each table in the full data phase.

Count

View the data volume processed for each table in the full data phase.

Connector - Kafka

Metric

Definition

Unit

Application Scenario

Limitations

commitsSucceeded

Successful offset commit count

Count

Verify that the offset commit was successful.

VVR 8.0.9 or later.

commitsFailed

Number of failed offset commits.

Count

Verify that the offset commit is successful.

Fetch Rate

Data pull frequency.

times/s

Determine data pull latency and speed.

Fetch Latency Avg

Average latency of data pull operations.

ms

Determine data pull latency and speed.

Fetch Size Avg

Average bytes per pull.

Bytes

Determine data pull latency and speed.

Avg Records In Per-Request

Average message count per pull

Items

Determine data pull latency and speed.

currentSendTime

Time the last record was sent.

None

Determine consumption progress.

batchSizeAvg

Average bytes per batch.

Bytes

Determine data write latency and speed.

requestLatencyAvg

Average request latency.

ms

Determine data write latency and speed.

requestsInFlight

Number of in-flight requests.

None

Determine data write latency and speed.

recordsPerRequestAvg

Average messages per request

Items

Determine data write latency and speed.

recordSizeAvg

Average message size (in bytes)

Bytes

Determine data write latency and speed.

Connector - Paimon

Metric

Definition

Unit

Application Scenario

Limitations

Number of Writers

Number of writer instances.

Count

Indicates how many buckets are currently being written to. A large number can affect write efficiency and increase memory consumption. Analyze whether the number of buckets or partition key settings are reasonable.

VVR 8.0.9 or later.

Max Compaction Thread Busy

Maximum busy level of small file compaction threads.

Ratio

Among the buckets currently being written to, this is the maximum percentage of time that compaction threads were active in the last minute. It reflects the pressure on small file compaction.

Average Compaction Thread Busy

Average busy level of small file compaction threads.

Ratio

Among the buckets currently being written to, this is the average percentage of time that compaction threads were active in the last minute. It reflects the pressure on small file compaction.

Max Number of Level 0 Files

Maximum number of Level 0 files.

Item

The maximum number of Level 0 files (small files) among the buckets currently being written to. This is only meaningful for primary key tables and reflects whether the compaction efficiency can keep up with the write efficiency.

Average Number of Level 0 Files

Average number of Level 0 files.

Count

The average number of Level 0 files (small files) among the buckets currently being written to. This is only meaningful for primary key tables and reflects whether the compaction efficiency can keep up with the write efficiency.

Last Commit Duration

Duration of the last commit.

ms

If the duration is too long, check if too many buckets are being written to simultaneously.

Number of Partitions Last Committed

Number of partitions written to in the last commit.

Item

A large number can affect write efficiency and increase memory consumption. Analyze whether the number of buckets or partition key settings are reasonable.

Number of Buckets Last Committed

Number of buckets written to in the last commit.

Item

A large number can affect write efficiency and increase memory consumption. Analyze whether the number of buckets or partition key settings are reasonable.

Used Write Buffer

Used write buffer memory size.

Bytes

The used buffer size for writer nodes across all task managers. This buffer occupies Java heap memory. If set too large, it may cause an out-of-memory (OOM) error.

Total Write Buffer

Total allocated write buffer memory size.

Bytes

The configured buffer size for writer nodes across all task managers. This buffer occupies Java heap memory. If set too large, it may cause an OOM error.

Data ingestion

Metric

Definition

Unit

Application Scenario

Version limitations

isSnapshotting

Indicates whether the job is in the full data processing phase (1 means it is).

None

Determine the job processing phase.

VVR 8.0.9 or later.

isBinlogReading

Indicates whether the job is in the incremental data processing phase (1 means it is).

None

Determine the job processing phase.

Num of remaining tables

Number of tables waiting to be processed in the full data phase.

Item

View the number of remaining unprocessed tables.

Num of snapshotted tables

Number of tables already processed in the full data phase.

Count

View the number of processed tables.

Num of remaining SnapshotSplits

Number of shards waiting to be processed in the full data phase.

Unit

Viewing the number of processed shards

Num of processed SnapshotSplits

Number of shards already processed in the full data phase.

item

Checking the number of unprocessed shards

currentFetchEventTimeLag

Latency between data generation and being read from the database.

ms

View the latency of reading binary logs from the database.

currentReadTimestampMs

Timestamp of the latest data record read.

ms

View the time of the latest data read.

numRecordsIn

Number of records already read.

Items

View the total volume of processed data.

numRecordsInPerTable

Number of records read from each table.

Item

View the total data volume processed for each table.

numSnapshotRecords

Number of records processed in the full data phase.

Items

View the data volume processed in the full data phase.

numSnapshotRecordsPerTable

Number of records processed for each table in the full data phase.

Entries

View the data volume processed for each table in the full data phase.