This topic describes the Flink metrics that Managed Service for Prometheus supports.
Managed Service for Prometheus charges fees based on the volume of data written or the number of data points reported. Metrics are classified into the following two types:
Basic metrics: Reporting and writing basic metrics from Alibaba Cloud Realtime Compute for Apache Flink to Managed Service for Prometheus is free of charge. This benefit does not apply to other Flink services, such as self-managed ones.
Custom metrics: All other metrics are custom metrics. You are charged for custom metrics starting from January 6, 2020.
Metric descriptions
Metric | Meaning | Details | Unit | Metric type |
| Number of job restarts due to errors | The number of times a job is restarted due to an error. This does not include restarts from JobManager failovers. | Count | Custom metric |
| Business latency | A high latency indicates that a delay may occur when the job pulls or processes data. | ms | Custom metric |
| Transmission latency | A high latency indicates that a delay may occur when the job pulls data. | ms | Custom metric |
| Total number of records that all operators receive | If the value of this metric does not increase for a long time for an operator, the upstream source may have dropped data, which prevents successful data transmission. In this case, check the upstream data. | Count | Custom metric |
| Total number of output records | If the value of this metric does not increase for a long time for an operator, a logic error in the job code may have caused data to be dropped. This prevents successful data transmission. In this case, check the job code logic. | Count | Custom metric |
| Total number of input bytes | View the input throughput from the upstream source to observe job traffic. | Bytes | Custom metric |
| Total number of output bytes | View the output throughput to the upstream sink to observe job traffic. | Bytes | Custom metric |
| Total number of records that all operators receive | If the value of this metric does not increase for a long time for an operator, the upstream source may have dropped data, which prevents successful data transmission. In this case, check the upstream data. | Count | Custom metric |
| Number of records that the entire data stream receives per second | Use this metric to monitor the processing speed of the entire data stream. For example, you can use this metric to check whether the processing speed of the entire data stream meets expectations and how the performance changes under different input data loads. | Count/s | Custom metric |
| Total number of output records | If the value of this metric does not increase for a long time for an operator, a logic error in the job code may have caused data to be dropped. This prevents successful data transmission. In this case, check the job code logic. | Count | Custom metric |
| Number of records that the entire data stream outputs per second | Use this metric to monitor the output speed of the entire data stream. This metric measures the number of records that the entire data stream outputs per second. For example, you can use this metric to check whether the output speed of the entire data stream meets expectations and how the performance changes under different output data loads. | Count/s | Custom metric |
| Input records for the Source operator only | View the upstream data input. | Count | Custom metric |
| Total number of output records at the sink | View the upstream data output. | Count | Custom metric |
| Number of records that the entire data stream receives per second | Use this metric to monitor the processing speed of the entire data stream. For example, you can use this metric to check whether the processing speed of the entire data stream meets expectations and how the performance changes under different input data loads. | Count/s | Custom metric |
| Number of records that the entire data stream outputs per second | Use this metric to monitor the output speed of the entire data stream. This metric measures the number of records that the entire data stream outputs per second. For example, you can use this metric to check whether the output speed of the entire data stream meets expectations and how the performance changes under different output data loads. | Count/s | Custom metric |
| Number of records that the data source receives per second | Use this metric to understand the generation speed of each data source. This metric measures the number of records that each data source generates per second. For example, in a data stream, different data sources may produce different numbers of records. The value of this metric helps you understand the generation speed of each data source and adjust the data stream for better performance. This data is also used for alert monitoring. If the value is 0, the upstream source may have dropped data. Check whether the output is blocked because the upstream data has not been consumed. | Count/s | Custom metric |
| Number of records that the data sink outputs per second | Use this metric to understand the output speed of each sink. This metric measures the number of records that each sink outputs per second. For example, in a data stream, different sinks may output different numbers of records. The value of this metric helps you understand the output speed of each sink and adjust the data stream for better performance. This data is also used for alert monitoring. If the value is 0, a logic error in the job code may have caused all data to be filtered out. In this case, check the job code logic. | Count/s | Custom metric |
| Number of data buffers consumed locally per second | A high value for this metric indicates frequent local communication between tasks, which means communication on the same node. | Count/s | Custom metric |
| Number of buffers received from remote TaskManagers per second | This metric reflects the frequency of communication across TaskManagers. | Count/s | Custom metric |
| Number of buffers sent to other tasks per second | This metric helps you understand the output pressure of tasks and the network bandwidth usage. | Count/s | Custom metric |
| Total number of input bytes per second (Local) | View the input stream rate from the upstream source to observe job traffic. | Bytes/s | Custom metric |
| Total number of output bytes per second | View the output throughput to the upstream sink to observe job traffic. | Bytes/s | Custom metric |
| Number of data records not read by the source | The number of data records in the external system that have not yet been pulled by the Source operator. | Count | Custom metric |
| Time the source has not processed data | This metric reflects whether the Source operator is idle. A large value indicates a low data generation rate in the external system. | ms | Custom metric |
| Total number of input bytes per second | None | Bytes/s | Custom metric |
| Total number of output bytes per second | None | Bytes/s | Custom metric |
| Time taken to send the latest record | None | ms | Custom metric |
| Total number of checkpoints | None | Unit | Custom metric |
| Number of failed checkpoints | None | unit | Custom metric |
| Number of completed checkpoints | None | Unit | Custom metric |
| Number of checkpoints in progress | None | Unit | Custom metric |
| Duration of the last checkpoint | If a checkpoint takes too long or times out, the state may be too large, a temporary network issue may have occurred, barriers may not be aligned, or data backpressure may exist. | ms | Custom metric |
| Size of the last checkpoint | The size of the last checkpoint that was actually uploaded. This helps you analyze checkpoint performance when a bottleneck occurs. | Bytes | Custom metric |
| Maximum latency of a single state clear operation | View the performance of clearing state. | ns | Custom metric |
| Maximum latency of a single Value State access | View the performance of an operator accessing Value State. | ns | Custom metric |
| Maximum latency of a single Value State update | View the performance of a Value State update. | nanosecond (ns) | Custom metric |
| Maximum latency of a single Aggregating State access | View the performance of an operator accessing Aggregating State. | ns | Custom metric |
| Maximum latency of a single Aggregating State add operation | View the performance of an Aggregating State add operation. | ns | Custom metric |
| Maximum latency of a single Aggregating State merge namespace operation | View the performance of an Aggregating State merge namespace operation. | ns | Custom metric |
| Maximum latency of a single Reducing State access | View the performance of an operator accessing Reducing State. | ns | Custom metric |
| Maximum latency of a single Reducing State add operation | View the performance of a Reducing State add operation. | ns | Custom metric |
| Maximum latency of a single Reducing State merge namespace operation | View the performance of a Reducing State merge namespace operation. | ns | Custom metric |
| Maximum latency of a single Map State access | View the performance of an operator accessing Map State. | ns | Custom metric |
| Maximum latency of a single Map State put operation | View the performance of a Map State put operation. | ns | Custom metric |
| Maximum latency of a single Map State put all operation | View the performance of a Map State put all operation. | ns | Custom metric |
| Maximum latency of a single Map State remove operation | View the performance of a Map State remove operation. | ns | Custom metric |
| Maximum latency of a single Map State contains operation | View the performance of a Map State contains operation. | ns | Custom metric |
| Maximum latency of a single Map State entries init operation | View the performance of a Map State entries init operation. | ns | Custom metric |
| Maximum latency of a single Map State keys init operation | View the performance of a Map State keys init operation. | ns | Custom metric |
| Maximum latency of a single Map State values init operation | View the performance of a Map State values init operation. | ns | Custom metric |
| Maximum latency of a single Map State iterator init operation | View the performance of a Map State iterator init operation. | ns | Custom metric |
| Maximum latency of a single Map State empty operation | View the performance of a Map State empty operation. | ns | Custom metric |
| Maximum latency of a single Map State iterator hasNext operation | View the performance of a Map State iterator hasNext operation. | ns | Custom metric |
| Maximum latency of a single Map State iterator next operation | View the performance of a Map State iterator next operation. | ns | Custom metric |
| Maximum latency of a single Map State iterator remove operation | View the performance of a Map State iterator remove operation. | ns | Custom metric |
| Maximum latency of a single List State access | View the performance of an operator accessing List State. | ns | Custom metric |
| Maximum latency of a single List State add operation | View the performance of a List State add operation. | ns | Custom metric |
| Maximum latency of a single List State add all operation | View the performance of a List State add all operation. | ns | Custom metric |
| Maximum latency of a single List State update operation | View the performance of a List State update operation. | ns | Custom metric |
| Maximum latency of a single List State merge namespace operation | View the performance of a List State merge namespace operation. | ns | Custom metric |
| Maximum latency of accessing the first entry of a Sorted Map State | View the performance of an operator accessing Sorted Map State. | ns | Custom metric |
| Maximum latency of accessing the last entry of a Sorted Map State | View the performance of an operator accessing Sorted Map State. | ns | Custom metric |
| Size of state data | By observing this metric, you can:
| Byte | Custom metric |
| Size of the state data file | By observing this metric, you can:
| Byte | Custom metric |
| Time when each task last received a watermark | The data receiving latency of the TaskManager. | N/A | Custom metric |
| Watermark latency | The job latency at the subtask level. | ms | Custom metric |
| CPU load of a single JobManager | If this value is consistently greater than 100%, the CPU is very busy and the load is high. This may affect system performance, leading to system stuttering or long response times. | N/A | Basic metric |
| Heap memory of the JobManager | None | Bytes | Basic metric |
| Committed heap memory of the JobManager | None | Byte | Basic metric |
| Maximum heap memory of the JobManager | None | Bytes | Basic metric |
| Non-heap memory of the JobManager | None | Byte | Basic metric |
| Committed non-heap memory of the JobManager | None | Byte | Basic metric |
| Maximum non-heap memory of the JobManager | None | Bytes | Basic metric |
| Number of JobManager threads | Too many JobManager threads can consume excessive memory and reduce job stability. | Unit | Basic metric |
| Number of JobManager GCs | Too many GCs can consume excessive memory and affect job performance. This metric can help you diagnose jobs and troubleshoot job-level failures. | Count | Basic metric |
| Number of young generation space GCs for the JobManager (G1 garbage collector) | None | Count | Custom metric |
| Number of old generation space GCs for the JobManager (G1 garbage collector) | None | Count | Custom metric |
| Time spent on young generation space GCs for the JobManager (G1 garbage collector) | None | ms | Custom metric |
| Time spent on old generation space GCs for the JobManager (G1 garbage collector) | None | ms | Custom metric |
| Number of collections by the JobManager CMS garbage collector | None | Count | Basic metric |
| Duration of each JobManager GC | Long GCs can consume excessive memory and affect job performance. This metric can help you diagnose jobs and troubleshoot job-level failures. | ms | Basic metric |
| Time spent on collection by the JobManager CMS garbage collector | None | ms | Basic metric |
| Total number of classes loaded by the JobManager JVM after creation | If too many classes are loaded after the JobManager JVM is created, it can consume excessive memory and affect job performance. | N/A | Basic metric |
| Total number of classes unloaded by the JobManager JVM after creation | If too many classes are unloaded after the JobManager JVM is created, it can consume excessive memory and affect job performance. | N/A | Basic metric |
| CPU load of a single TaskManager | The sum of processes that the CPU is processing and waiting to process over a period of time. This can be understood as how busy the CPU is. The CPU busyness is related to the number of CPU cores. In Flink, CPU load is calculated as `CPU Usage / Number of CPU cores`. If the value of | N/A | Basic metric |
| CPU utilization of a single JobManager | This value reflects Flink's occupation of CPU time slices.
If this value is consistently greater than 100%, the CPU is very busy. If the load is high but the CPU utilization is low, it may be because frequent read and write operations have led to too many processes in an uninterruptible sleep state. | N/A | Basic metric |
| CPU utilization of a single TaskManager | This value reflects Flink's occupation of CPU time slices.
If this value is consistently greater than 100%, the CPU is very busy. If the load is high but the CPU utilization is low, it may be because frequent read and write operations have led to too many processes in an uninterruptible sleep state. | N/A | Basic metric |
| Heap memory of the TaskManager | None | Bytes | Basic metric |
| Committed heap memory of the TaskManager | None | Byte | Basic metric |
| Maximum heap memory of the TaskManager | None | Bytes | Basic metric |
| Non-heap memory of the TaskManager | None | Byte | Basic metric |
| Committed non-heap memory of the TaskManager | None | Bytes | Basic metric |
| Maximum non-heap memory of the TaskManager | None | Bytes | Basic metric |
| Memory of the entire process obtained through Linux | View changes in process memory. | Bytes | Basic metric |
| Number of TaskManager threads | Too many TaskManager threads can consume excessive memory and reduce job stability. | Unit | Basic metric |
| Number of TaskManager GCs | Too many GCs can consume excessive memory and affect job performance. This metric can help you diagnose jobs and troubleshoot job task-level failures. | Count | Basic metric |
| Number of young generation space GCs for the TaskManager (G1 garbage collector) | None | Count | Custom metric |
| Number of old generation space GCs for the TaskManager (G1 garbage collector) | None | Count | Custom metric |
| Time spent on young generation space GCs for the TaskManager (G1 garbage collector) | None | ms | Custom metric |
| Time spent on old generation space GCs for the TaskManager (G1 garbage collector) | None | ms | Custom metric |
| Number of collections by the TaskManager CMS garbage collector | None | Count | Basic metric |
| Duration of each TaskManager GC | Long GCs can consume excessive memory and affect job performance. This metric helps you diagnose jobs and troubleshoot job-level failures. | ms | Basic metric |
| Time spent on collection by the JobManager CMS garbage collector | None | ms | Basic metric |
| Total number of classes loaded by the TaskManager JVM after creation | If too many classes are loaded after the TaskManager JVM is created, it can consume excessive memory and affect job performance. | N/A | Basic metric |
| Total number of classes unloaded by the TaskManager JVM after creation | If too many classes are unloaded after the TaskManager JVM is created, it can consume excessive memory and affect job performance. | N/A | Basic metric |
| Job runtime | None | ms | Custom metric |
| Number of running jobs | None | N/A | Custom metric |
| Number of available task slots | None | N/A | Custom metric |
| Total number of task slots | None | N/A | Custom metric |
| Number of registered TMs | None | N/A | Custom metric |
| Number of bytes that the job reads from a remote source per second | None | Bytes/s | Custom metric |
| Number of records dropped due to window latency | None | Unit | Custom metric |
| Window latency ratio | None | N/A | Custom metric |
| Whether the job is in the full data processing phase | Determine the job processing phase. | N/A | Custom metric |
| Whether the job is in the incremental data processing phase | Determine the job processing phase. | N/A | Custom metric |
| Number of unprocessed tables in the full data phase | View the number of remaining unprocessed tables. | Unit | Custom metric |
| Number of tables waiting to be processed in the full data phase | View the number of remaining unprocessed tables. | Item | Custom metric |
| Number of processed tables in the full data phase | View the number of processed tables. | Item | Custom metric |
| Number of processed shards in the full data phase | View the number of processed shards. | Unit | Custom metric |
| Number of shards waiting to be processed in the full data phase | View the number of unprocessed shards. | Unit | Custom metric |
| Number of shards waiting to be processed in the full data phase | View the number of unprocessed shards. | Unit | Custom metric |
| Timestamp of the latest data record that is read | View the timestamp of the latest binary logging data. | ms | Custom metric |
| Number of processed data records in the full data phase | View the amount of data processed in the full data phase. | Count | Custom metric |
| Number of data records read from each table | View the total amount of data processed for each table. | Count | Custom metric |
| Number of processed data records for each table in the full data phase | View the amount of data processed for each table in the full data phase. | Count | Custom metric |
| Number of insert DML statements processed for each table in the incremental phase | View the data volume of insert statements for each table. | Count | Custom metric |
| Number of update DML statements processed for each table in the incremental phase | View the data volume of update statements for each table. | Items | Custom metric |
| Number of delete DML statements processed for each table in the incremental phase | View the data volume of delete statements for each table. | Count | Custom metric |
| Number of DDL statements processed for each table in the incremental phase | View the data volume of DDL statements for each table. | Item | Custom metric |
| Number of insert DML statements processed in the incremental phase | View the data volume of insert statements. | Count | Custom metric |
| Number of update DML statements processed in the incremental phase | View the data volume of update statements. | Items | Custom metric |
| Number of delete DML statements processed in the incremental phase | View the data volume of delete statements. | Item | Custom metric |
| Number of DDL statements processed in the incremental phase | View the data volume of DDL statements. | Item | Custom metric |
Common metric labels
Label | Description |
| The project namespace. |
| The deployment name. |
| The deployment ID. |
| The job ID. |
References
For more information about metrics for ARMS Application Monitoring, see Application Monitoring metrics.