All Products
Search
Document Center

Realtime Compute for Apache Flink:Metrics

Last Updated:Oct 17, 2024

This topic describes the metrics that are supported by fully managed Flink.

Usage notes

  • Metrics that are reported by a source reflect only the current situation of the source and cannot be used to identify the root cause of an issue. You need to use auxiliary metrics or tools to identify the root cause of an issue. The following table describes the metrics in specific scenarios.

    Scenario

    Description

    An operator in a deployment has backpressure.

    The backpressure detection feature provided by Flink UI, rather than metrics, is the most direct method to detect backpressure. If backpressure exists, the rate at which the source sends data to downstream operators decreases. In this case, the value of the sourceIdleTime metric may periodically increase and the values of the currentFetchEventTimeLag and currentEmitEventTimeLag metrics may continuously increase. In extreme cases, such as when an operator is stuck, the value of the sourceIdleTime metric continuously increases.

    The source has a performance bottleneck.

    If only the throughput of the source is insufficient, no backpressure can be detected in your deployment. The sourceIdleTime metric remains at a small value because the source keeps running. The values of the currentFetchEventTimeLag and currentEmitEventTimeLag metrics are large and close to each other.

    Data skew occurs at the upstream, or a partition is empty.

    If data skew occurs or a partition is empty, one or more sources are idle. In this case, the value of the sourceIdleTime metric for the sources is large.

  • If the latency of a deployment is high, you can use the metrics that are described in the following table to analyze the data processing capabilities of fully managed Flink and the retention of data in the external system.

    Metric

    Description

    sourceIdleTime

    Indicates whether the source is idle. If the value of this metric is large, your data is generated in the external system at a low rate.

    currentFetchEventTimeLag and currentEmitEventTimeLag

    Indicate the latency when fully managed Flink processes data. You can analyze the data processing capabilities of a source based on the difference between the values of the two metrics. The difference indicates the duration for which the data is retained in the source.

    • If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system due to issues related to network I/O or parallelism.

    • If the difference between the values of the two metrics is large, the source does not efficiently process data due to issues related to data parsing, parallelism, or backpressure.

    pendingRecords

    Indicates the amount of data that is retained in the external system.

Overview

Metric

Definition

Description

Unit

Supported connector

Num of Restarts

The number of times that a deployment is restarted when a deployment failover occurs.

This metric indicates the number of times that a deployment is restarted when a deployment failover occurs. The number of times that the deployment is restarted when the JobManager failover occurs is excluded. This metric is used to check the availability and status of the deployment.

Count

N/A

current Emit Event Time Lag

The processing latency.

If the value of this metric is large, a data latency may occur in the deployment when the system pulls or processes data.

Milliseconds

  • Kafka

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Postgres Change Data Capture (CDC)

  • Hologres (Hologres binlog source table)

current Fetch Event Time Lag

The transmission latency.

If the value of this metric is large, a data latency may occur in the deployment when the system pulls data. In this case, you must check the information about the network I/O or the source. You can analyze the data processing capabilities of a source based on the difference between the values of this metric and the currentEmitEventTimeLag metric. The difference between the values of the two metrics indicates the duration for which the data is retained in the source. The processing mechanism varies based on whether miniBatch is enabled:

  • If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system due to issues related to network I/O or parallelism.

  • If the difference between the values of the two metrics is large, the processing capability of the deployment is insufficient. This leads to data retention in the source. To resolve this issue, perform the following steps: On the Deployments page, find the deployment that you want to manage and click its name. In the deployment details panel, click the Status tab. On the Status tab, click the value in the Name column. On the page that appears, click the BackPressure tab to locate the Vertex topology that causes the issue. Then, on the BackPressure tab, click Dump in the Thread Dump column to go to the Thread Dump tab to analyze the stack that has a performance bottleneck.

Milliseconds

  • Kafka

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Postgres CDC

  • Hologres (Hologres binlog source table)

numRecordsIn

The total number of input data records of all operators.

If the value of this metric does not increase for a long period of time for an operator, data may be missing from the source. In this case, you must check the data of the source.

Count

All built-in connectors

numRecordsOut

The total number of output data records.

If the value of this metric does not increase for a long period of time for an operator, an error may occur in the code logic of the deployment and data is missing. In this case, you must check the code logic of the deployment.

Count

All built-in connectors

numRecordsInofSource

The total number of data records that flow into the source operator in each operator.

This metric is used to check the number of data records that flow into the source.

Count

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • ElasticSearch

  • Hologres

numRecordsOutOfSink

The total number of output data records in the sink.

This metric is used to check the number of data records that are exported by the source.

Count

  • Kafka

  • Simple Log Service

  • DataHub

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

numRecordsInPerSecond

The number of input data records per second for all data streams.

This metric is used to monitor the overall data stream processing speed. For example, you can use the value of the numRecordsInPerSecond metric to check whether the overall data stream processing speed meets the expected requirements and how the deployment performance changes under different input data loads.

Data records/s

All built-in connectors

numRecordsOutPerSecond

The number of output data records per second for all data streams.

This metric is used to monitor the number of output data records per second for all data streams. This metric is also used to monitor the overall data stream output speed.

For example, you can use the value of the numRecordsOutPerSecond metric to check whether the overall data stream output speed meets the expected requirements and how the deployment performance changes under different output data loads.

Data records/s

All connectors

numRecordsInOfSourcePerSecond (IN RPS)

The number of input data records per second in a source.

This metric is used to monitor the number of input data records per second in a source and monitor the speed at which data records are generated in the source. For example, the number of data records that can be generated varies based on the type of the data source. You can use the value of the numRecordsInOfSourcePerSecond metric to check the speed at which data records are generated in a source and adjust data streams to improve performance. This metric is also used for monitoring and alerting.

If the value of this metric is 0, data may be missing from the source. In this case, you must check whether data output is blocked because the data of the source is not consumed.

Data records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • ElasticSearch

  • Hologres

numRecordsOutOfSinkPerSecond (OUT RPS)

The number of output data records per second in a sink.

This metric is used to monitor the number of output data records per second in a sink and monitor the speed at which data records are exported from the sink. For example, the number of data records that can be exported varies based on the type of the sink.

You can use the value of the numRecordsOutOfSinkPerSecond metric to check the speed at which data records are exported from a sink and adjust data streams to improve performance. This metric is also used for monitoring and alerting. If the value of this metric is 0, all data is filtered due to a defect in the code logic of the deployment. In this case, you must check the code logic of the deployment.

Data records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • Simple Log Service

  • DataHub

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

pendingRecords

The number of data records that are not read by the source.

This metric is used to check the number of data records that are not pulled by the source from the external system.

Count

  • Kafka

  • ElasticSearch

sourceIdleTime

The duration for which data is not processed in the source.

This metric specifies whether the source is idle. If the value of this metric is large, your data is generated at a low rate in the external system.

Milliseconds

  • Kafka

  • ApsaraMQ for RocketMQ

  • Postgres CDC

  • Hologres (Hologres binlog source table)

Checkpoint

Metric

Definition

Description

Unit

Num of Checkpoints

The number of checkpoints.

This metric is used to obtain the overview of checkpoints and configure alerts for checkpoints.

Count

lastCheckpointDuration

The duration for which the last checkpoint is used.

If the checkpoint is used for a long period of time or times out, the possible cause may be that the storage space occupied by state data is excessively large, a temporary network error occurs, barriers are not aligned, or data backpressure exists.

Milliseconds

lastCheckpointSize

The size of the last checkpoint.

This metric is used to view the size of the last checkpoint that is uploaded. You can analyze the performance of checkpoints when a bottleneck issue occurs for the checkpoints based on the value of this metric.

Bytes

State

Note

To use the metric that is related to state access latency, you must specify the metric. You must set the state.backend.latency-track.keyed-state-enabled parameter to true to enable the metric in the Additional Configuration section of the Advanced tab on the Draft Editor page in the console of fully managed Flink. After you enable the metric that is related to state access latency, the deployment performance may be affected when the deployment is running.

Metric

Definition

Description

Unit

Supported version

State Clear Latency

The maximum latency in a state data cleanup.

You can view the performance of state data cleanup.

Nanoseconds

Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.0 or later supports these metrics.

Value State Latency

The maximum latency in single ValueState access.

You can view the ValueState access performance.

Nanoseconds

Aggregating State Latency

The maximum latency in single AggregatingState access.

You can view the AggregatingState access performance.

Nanoseconds

Reducing State Latency

The maximum latency in single ReducingState access.

You can view the ReducingState access performance.

Nanoseconds

Map State Latency

The maximum latency in single MapState access.

You can view the MapState access performance.

Nanoseconds

List State Latency

The maximum latency in single ListState access.

You can view the ListState access performance.

Nanoseconds

Sorted Map State Latency

The maximum latency in single SortedMapState access.

You can view the SortedMapState access performance.

Nanoseconds

State Size

The size of the state data.

This metric helps you perform the following operations:

  • Directly identify nodes or identify nodes in advance in which state data bottlenecks may occur.

  • Check whether the time to live (TTL) of state data takes effect.

Bytes

Only Realtime Compute for Apache Flink that uses VVR 4.0.12 or later supports this metric.

State File Size

The size of the state data file.

This metric helps you perform the following operations:

  • Check the size of the state data file in the local disk. You can take actions in advance if the size is large.

  • Determine whether the state data is excessively large if the local disk space is insufficient.

Bytes

Only Realtime Compute for Apache Flink that uses VVR 4.0.13 or later supports this metric.

I/O

Metric

Definition

Description

Unit

Supported connector

numBytesIn

The total number of input bytes.

This metric is used to check the size of the input data records of the source. This can help you observe the deployment throughput.

Bytes

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

numBytesInPerSecond

The total number of input bytes per second.

This metric is used to check the rate at which data flows into the source. This can help you observe the deployment throughput.

Bytes/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

numBytesOut

The total number of output bytes.

This metric is used to check the size of the output data records of the source. This can help you observe the deployment throughput.

Bytes

  • Kafka

  • ApsaraMQ for RocketMQ

  • DataHub

  • ApsaraDB for HBase

numBytesOutPerSecond

The total number of output bytes per second.

This metric is used to check the rate at which data is exported by the source. This can help you observe the deployment throughput.

Bytes/s

  • Kafka

  • ApsaraMQ for RocketMQ

  • DataHub

  • ApsaraDB for HBase

Task numRecords I/O

The total number of data records that flow into each subtask and data records that are exported by each subtask.

This metric is used to check whether I/O bottlenecks exist in the deployment.

Count

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • Simple Log Service

  • DataHub

  • ElasticSearch

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

Task numRecords I/O PerSecond

The total number of data records that flow into each subtask and data records that are exported by each subtask per second.

This metric is used to check whether I/O bottlenecks exist in the deployment and determine the severity of the I/O bottlenecks based on the input and output rate of each subtask.

Data records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • Simple Log Service

  • DataHub

  • ElasticSearch

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

currentSendTime

The duration that is required by each subtask to export the last data record to the sink.

If the value of this metric is small, data records are exported by each subtask to the sink at an excessively slow rate.

Milliseconds

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Hologres

    Note

    When you write data to Hologres in remote procedure call (RPC) mode or by using a Java Database Connectivity (JDBC) driver, the Hologres connector supports this metric. When you write data to Hologres in BHClient mode, the Hologres connector does not support this metric.

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

Watermark

Metric

Definition

Description

Unit

Supported connector

Task InputWatermark

The time when each task receives the latest watermark.

This metric is used to check the latency of data receiving by TaskManagers.

N/A

N/A

watermarkLag

The latency of watermarks.

This metric is used to determine the latency of subtasks.

Milliseconds

  • Kafka

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Hologres (Hologres binlog source table)

CPU

Metric

Definition

Description

Unit

JM CPU Usage

The CPU utilization of a JobManager.

This metric indicates the utilization of CPU time slices that are occupied by fully managed Flink. If the value of this parameter is 100%, one CPU core is used. If the value of this parameter is 400%, four CPU cores are used. If the value of this metric is greater than 100% for a long period of time, the CPU of the JobManager is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations.

Note

Only Realtime Compute for Apache Flink that uses VVR 6.0.6 or later supports this metric.

N/A

TM CPU Usage

The CPU utilization of a TaskManager.

This metric indicates the utilization of CPU time slices that are occupied by fully managed Flink. If the value of this parameter is 100%, one CPU core is used. If the value of this parameter is 400%, four CPU cores are used. If the value of this metric is greater than 100% for a long period of time, the CPU of the TaskManager is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations.

N/A

Memory

Metric

Definition

Description

Unit

JM Heap Memory

The heap memory of a JobManager.

This metric is used to check the change in the heap memory of the JobManager.

Bytes

JM NonHeap Memory

The non-heap memory of a JobManager.

This metric is used to check the change in the non-heap memory of the JobManager.

Bytes

TM Heap Memory

The heap memory of a TaskManager.

This metric is used to check the change in the heap memory of the TaskManager.

Bytes

TM nonHeap Memory

The non-heap memory of a TaskManager.

This metric is used to check the change in the non-heap memory of the TaskManager.

Bytes

TM Mem (RSS)

The memory of the entire process by using Linux.

This metric is used to check the change of the memory of the process.

Bytes

JVM

Metric

Definition

Description

Unit

JM Threads

The number of threads of a JobManager.

A large number of threads of the JobManager occupies excessive memory space. This reduces the deployment stability.

Count

TM Threads

The number of threads of a TaskManager.

A large number of threads of the TaskManager occupy excessive memory space. This reduces the deployment stability.

Count

JM GC Count

The number of times that garbage collection (GC) of a JobManager occurs.

If GC of the JobManager occurs a large number of times, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.

Count

JM GC Time

The duration of each GC of a JobManager.

If each GC lasts for a long period of time, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.

Milliseconds

TM GC Count

The number of times that GC of a TaskManager occurs.

If GC of the TaskManager occurs a large number of times, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.

Count

TM GC Time

The duration for which each GC of a TaskManager lasts.

If each GC lasts for a long period of time, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.

Milliseconds

JM ClassLoader

The total number of classes that are loaded or unloaded after the Java Virtual Machine (JVM) where a JobManager resides is created.

After the JVM where the JobManager resides is created, if the total number of the classes that are loaded or unloaded is excessively large, excessive memory space is occupied. This affects the deployment performance.

N/A

TM ClassLoader

The total number of classes that are loaded or unloaded after the JVM where a TaskManager resides is created.

After the JVM where the TaskManager resides is created, if the total number of the classes that are loaded or unloaded is excessively large, excessive memory space is occupied. This affects the deployment performance.

N/A

MySQL connector

Metric

Definition

Unit

Scenario

Supported version

isSnapshotting

Specifies whether the job is in the snapshot phase. A value of 1 specifies that the job is in the snapshot phase.

N/A

This metric indicates the job processing phase.

Only Realtime Compute for Apache Flink that uses VVR 8.0.9 or later supports this parameter.

isBinlogReading

Specifies whether the job is in the incremental phase. A value of 1 specifies that the job is in the incremental phase.

N/A

This metric indicates the job processing phase.

Num of remaining tables

The number of tables that are waiting to be processed in the snapshot phase.

Count

This metric measures the number of unprocessed tables.

Num of snapshotted tables

The number of processed tables in the snapshot phase.

Count

This metric measures the number of processed tables.

Num of remaining SnapshotSplits

The number of shards that are waiting to be processed in the snapshot phase.

Count

This metric measures the number of unprocessed shards.

Num of processed SnapshotSplits

The number of processed shards in the snapshot phase.

Count

This metric measures the number of processed shards.

currentFetchEventTimeLag

The difference between the time when data is generated and the time when data is read from the database.

Milliseconds

This metric indicates the latency of reading binary logs from the database.

currentReadTimestampMs

The timestamp of the latest data record that is read.

Milliseconds

This metric indicates the timestamp of the latest data record that is read.

numRecordsIn

The number of data records that are read.

Count

This metric measures the total number of processed data records.

numSnapshotRecords

The number of processed data records in the snapshot phase.

Count

This metric measures the number of processed data records in the snapshot phase.

numRecordsInPerTable

The number of data records that are read from each table.

Count

This metric measures the total number of processed data records in each table.

numSnapshotRecordsPerTable

The number of processed data records in each table in the snapshot phase.

Count

This metric measures the number of processed data records in each table in the snapshot phase.

Kafka connector

Metric

Definition

Unit

Scenario

Supported version

commitsSucceeded

The number of times that the offset is committed.

Count

This metric is used to determine whether the offset is committed.

Only Realtime Compute for Apache Flink that uses VVR 8.0.9 or later supports this parameter.

commitsFailed

The number of times that the offset fails to be committed.

Count

This metric is used to determine whether the offset is committed.

Fetch Rate

The frequency at which data is pulled.

Count/s

This metric is used to determine the latency that occurs when data is pulled and the rate at which data is pulled.

Fetch Latency Avg

The latency that occurs when data is pulled.

Milliseconds

This metric is used to determine the latency that occurs when data is pulled and the rate at which data is pulled.

Fetch Size Avg

The average number of bytes that are pulled each time.

Bytes

This metric is used to determine the latency that occurs when data is pulled and the rate at which data is pulled.

Avg Records In Per-Request

The average number of messages that are pulled each time.

Count

This metric is used to determine the latency that occurs when data is pulled and the rate at which data is pulled.

currentSendTime

The time when the last data record was sent.

N/A

This metric is used to determine the consumption progress.

batchSizeAvg

The average number of bytes that are written in each batch.

Bytes

This metric is used to determine the latency that occurs when data is written and the rate at which data is written.

requestLatencyAvg

The average latency of requests.

Milliseconds

This metric is used to determine the latency that occurs when data is written and the rate at which data is written.

requestsInFlight

The number of requests in progress.

N/A

This metric is used to determine the latency that occurs when data is written and the rate at which data is written.

recordsPerRequestAvg

The average number of messages that are processed per request.

Count

This metric is used to determine the latency that occurs when data is written and the rate at which data is written.

recordSizeAvg

The average number of bytes in a message.

Bytes

This metric is used to determine the latency that occurs when data is written and the rate at which data is written.

Paimon connector

Metric

Definition

Unit

Scenario

Supported version

Number of Writers

The number of Writers.

Count

This metric indicates the number of buckets to which data is being written. If the value of this metric is large, the write performance may be affected and more memory may be consumed. This metric is used to determine whether the number of buckets or the partition key settings are reasonable.

Only Realtime Compute for Apache Flink that uses VVR 8.0.9 or later supports this parameter.

Max Compaction Thread Busy

The ratio of the maximum time consumed by a thread to merge small files to the total time consumed to write data into a bucket.

Percentage

The metric indicates the maximum time consumed by the threads to merge small files in one minute. This metric indicates how busy the threads for merging small files are.

Average Compaction Thread Busy

The ratio of the total time consumed by the threads to merge small files to the total time consumed to write data into buckets.

Percentage

The metric indicates the average time consumed by the threads to merge small files in one minute. This metric indicates how busy the threads for merging small files are.

Max Number of Level 0 Files

The maximum number of Level 0 files.

Count

The maximum number of Level 0 files in the buckets to which data is being written. This metric works only for primary key tables and is used to determine whether the efficiency of merging small files can keep up with the writing efficiency.

Average Number of Level 0 Files

The average number of Level 0 files.

Count

The average number of Level 0 files in the buckets to which data is being written. This metric works only for primary key tables and is used to determine whether the efficiency of merging small files can keep up with the writing efficiency.

Last Commit Duration

The time consumed by the last commit.

Milliseconds

If the value of this metric is large, check whether data is being written to excessive buckets at the same time.

Number of Partitions Last Committed

The number of partitions to which data is written in the last commit.

Count

If the value of this metric is large, the write performance may be affected and more memory may be consumed. This metric is used to determine whether the number of buckets or the partition key settings are reasonable.

Number of Buckets Last Committed

The number of buckets to which data is written in the last commit.

Count

If the value of this metric is large, the write performance may be affected and more memory may be consumed. This metric is used to determine whether the number of buckets or the partition key settings are reasonable.

Used Write Buffer

The buffer size that is used by the writer nodes.

Bytes

The buffer size that is used by the writer nodes of all TaskManagers. This buffer occupies the Java heap memory. If the value of this metric is large, an out of memory (OOM) error may occur.

Total Write Buffer

The total buffer size that is allocated to the writer nodes.

Bytes

The buffer size that is allocated to the writer nodes of all TaskManagers. This buffer occupies the Java heap memory. If the value of this metric is large, an OOM error may occur.

Data ingestion

Metric

Definition

Unit

Scenario

Supported version

isSnapshotting

Specifies whether the job is in the snapshot phase. A value of 1 specifies that the job is in the snapshot phase.

N/A

This metric indicates the job processing phase.

Only Realtime Compute for Apache Flink that uses VVR 8.0.9 or later supports this parameter.

isBinlogReading

Specifies whether the job is in the incremental phase. A value of 1 specifies that the job is in the incremental phase.

N/A

This metric indicates the job processing phase.

Num of remaining tables

The number of tables that are waiting to be processed in the snapshot phase.

Count

This metric measures the number of unprocessed tables.

Num of snapshotted tables

The number of processed tables in the snapshot phase.

Count

This metric measures the number of processed tables.

Num of remaining SnapshotSplits

The number of shards that are waiting to be processed in the snapshot phase.

Count

This metric measures the number of unprocessed shards.

Num of processed SnapshotSplits

The number of processed shards in the snapshot phase.

Count

This metric measures the number of processed shards.

currentFetchEventTimeLag

The difference between the time when data is generated and the time when data is read from the database.

Milliseconds

This metric indicates the latency of reading binary logs from the database.

currentReadTimestampMs

The timestamp of the latest data record that is read.

Milliseconds

This metric indicates the timestamp of the latest data record that is read.

numRecordsIn

The number of data records that are read.

Count

This metric measures the total number of processed data records.

numRecordsInPerTable

The number of data records that are read from each table.

Count

This metric measures the total number of processed data records in each table.

numSnapshotRecords

The number of processed data records in the snapshot phase.

Count

This metric measures the number of processed data records in the snapshot phase.

numSnapshotRecordsPerTable

The number of processed data records in each table in the snapshot phase.

Count

This metric measures the number of processed data records in each table in the snapshot phase.