This topic describes the Flink metrics that Managed Service for Prometheus provides.
Managed Service for Prometheus charges fees based on the amount of written observable data or the number of data reports. The metrics are classified into two types: basic metrics and custom metrics. Custom metrics refer to non-basic metrics. Basic metrics are free of charge. You are charged for custom metrics starting from January 6, 2020.
Metrics
Metric | Definition | Description | Unit | Type |
| Number of times a job was restarted when a job failover occurred | This metric indicates the number of times that a job was restarted when a job failover occurred. The number of times that the job was restarted when a JobManager failover occurred is not included. | N/A | Custom metric |
| Processing latency | If the value of this metric is large, a delay may occur in the job when the system pulls or processes data. | Milliseconds | Custom metric |
| Transmission latency | If the value of this metric is large, a delay may occur in the job when the system pulls data. | Milliseconds | Custom metric |
| Total number of input data records of all operators | If the value of this metric does not increase for an extended period of time for an operator, data may be missing from the source. Therefore, data fails to be transmitted. In this case, you must check the data of the source. | N/A | Custom metric |
| Total number of output records | If the value of this metric does not increase for an extended period of time for an operator, an error may occur in the code logic of the job and data is missing. Therefore, data fails to be transmitted. In this case, you must check the code logic of the job. | N/A | Custom metric |
| Total number of input bytes | This metric measures the size of the input data records of the source. This helps observe the job throughput. | Bytes | Custom metric |
| Total number of output bytes | This metric measures the size of the output data records of the source. This helps observe the job throughput. | Bytes | Custom metric |
| Total number of input data records of all operators | If the value of this metric does not increase for an extended period of time for an operator, data may be missing from the source. Therefore, data fails to be transmitted. In this case, you must check the data of the source. | N/A | Custom metric |
| Number of input data records per second for the data stream | This metric measures the overall processing speed of a data stream. For example, the value of this metric helps determine whether the overall processing speed of the data stream meets the expected requirements and how the job performance changes under different input data loads. | Count/s | Custom metric |
| Total number of output records | If the value of this metric does not increase for an extended period of time for an operator, an error may occur in the code logic of the job and data is missing. Therefore, data fails to be transmitted. In this case, you must check the code logic of the job. | N/A | Custom metric |
| Number of output data records per second for the data stream | This metric measures the overall output speed of a data stream. The speed indicates the number of output data records per second for the data stream. For example, the value of this metric helps determine whether the overall output speed of the data stream meets the expected requirements and how the job performance changes under different output data loads. | Count/s | Custom metric |
| Total number of data records flowing into the source operator | This metric measures the number of data records that flow into the source. | N/A | Custom metric |
| Total number of output data records in a sink | This metric measures the number of data records that were exported by the source. | N/A | Custom metric |
| Number of input data records per second for the data stream | This metric measures the overall processing speed of a data stream. For example, the value of this metric helps determine whether the overall processing speed of the data stream meets the expected requirements and how the job performance changes under different input data loads. | Count/s | Custom metric |
| Number of output data records per second for the data stream | This metric measures the overall output speed of a data stream. The speed indicates the number of output data records per second for the data stream. For example, the value of this metric helps determine whether the overall output speed of the data stream meets the expected requirements and how the job performance changes under different output data loads. | Count/s | Custom metric |
| Number of input data records per second in a source | This metric measures the speed at which data records were generated in a source. The speed indicates the number of input data records per second in the source. For example, the number of data records that can be generated varies based on the type of each source in a data stream. The value of this metric helps determine the speed at which data records were generated in a source and adjust the data stream to improve performance. This metric is also used for monitoring and alerting. If the value of this metric is 0, data may be missing from the source. In this case, you must check whether data output was blocked because the data of the source was not consumed. | Count/s | Custom metric |
| Number of output data records per second in a sink | This metric measures the speed at which data records were exported from a sink. The speed indicates the number of output data records per second in the sink. For example, the number of data records that can be exported varies based on the type of each sink in a data stream. The value of the numRecordsOutOfSinkPerSecond metric helps determine the speed at which data records were exported from a sink and adjust the data stream to improve performance. This metric is also used for monitoring and alerting. If the value of this metric is 0, the code logic of the job may be invalid and all data is filtered out. In this case, you must check the code logic of the job. | Count/s | Custom metric |
| Number of locally consumed data buffers per second | If the value of this metric is large, inter-task communication is frequent on the local node. | Count/s | Custom metric |
| Number of buffers received from the remote TaskManager per second. | This metric indicates the frequency of inter-TaskManager communication. | Count/s | Custom metric |
| Number of buffers sent to other tasks per second | This metric measures the output pressure of tasks and the usage of network bandwidth. | Count/s | Custom metric |
| Total number of input bytes per second | This metric measures the rate at which data flowed into the source. This helps observe the job throughput. | Bytes/s | Custom metric |
| Total number of output bytes per second | This metric measures the rate at which data was exported by the source. This helps observe the job throughput. | Bytes/s | Custom metric |
| Number of data records not read by the source | This metric measures the number of data records that were not pulled by the source from the external system. | N/A | Custom metric |
| Duration for which data was not processed in the source | This metric specifies whether the source was idle. If the value of this metric is large, your data is generated at a low speed in the external system. | Milliseconds | Custom metric |
| Total number of input bytes per second | None | Bytes/s | Custom metric |
| Total number of output bytes per second | None | Bytes/s | Custom metric |
| Time consumed to send the latest record | None | Milliseconds | Custom metric |
| Total number of checkpoints | None | N/A | Custom metric |
| Number of failed checkpoints | None | N/A | Custom metric |
| Number of completed checkpoints | None | N/A | Custom metric |
| Number of checkpoints in progress | None | N/A | Custom metric |
| The duration for which the last checkpoint was used. | If the checkpoint takes an extended period of time or times out, the possible cause is that the storage space occupied by state data was excessively large, a temporary network error occurred, barriers were not aligned, or data backpressure existed. | Milliseconds | Custom metric |
| Size of the last checkpoint | This metric measures the size of the last checkpoint that was uploaded. This metric helps analyze the checkpoint performance when a bottleneck occurs. | Bytes | Custom metric |
| Maximum latency of a Clear operation on state data | This metric measures the performance of a Clear operation on state data. | Nanoseconds | Custom metric |
| Maximum latency of a Get operation on ValueState data | This metric measures the performance of accessing ValueState data by an operator. | Nanoseconds | Custom metric |
| Maximum latency of an Update operation on ValueState data | This metric measures the performance of an Update operation on ValueState data. | Nanoseconds | Custom metric |
| Maximum latency of a Get operation on AggregatingState data | This metric measures the performance of accessing AggregatingState data by an operator. | Nanoseconds | Custom metric |
| Maximum latency of an Add operation on AggregatingState data | This metric measures the performance of an Add operation on AggregatingState data. | Nanoseconds | Custom metric |
| Maximum latency of a Merge Namespace operation on AggregatingState data | This metric measures the performance of a Merge Namespace operation on AggregatingState data. | Nanoseconds | Custom metric |
| Maximum latency of a Get operation on ReducingState data | This metric measures the performance of accessing ReducingState data by an operator. | Nanoseconds | Custom metric |
| Maximum latency of an Add operation on ReducingState data | This metric measures the performance of an Add operation on ReducingState data. | Nanoseconds | Custom metric |
| Maximum latency of a Merge Namespace operation on ReducingState data | This metric measures the performance of a Merge Namespace operation on ReducingState data. | Nanoseconds | Custom metric |
| Maximum latency of a Get operation on MapState data | This metric measures the performance of accessing MapState data by an operator. | Nanoseconds | Custom metric |
| Maximum latency of a Put operation on MapState data | This metric measures the performance of a Put operation on MapState data. | Nanoseconds | Custom metric |
| Maximum latency of a PutAll operation on MapState data | This metric measures the performance of a PutAll operation on MapState data. | Nanoseconds | Custom metric |
| Maximum latency of a Remove operation on MapState data | This metric measures the performance of a Remove operation on MapState data. | Nanoseconds | Custom metric |
| Maximum latency of a Contains operation on MapState data | This metric measures the performance of a Contains operation on MapState data. | Nanoseconds | Custom metric |
| Maximum latency of an Init operation on MapState entries | This metric measures the performance of an Init operation on MapState entries. | Nanoseconds | Custom metric |
| Maximum latency of an Init operation on MapState keys | This metric measures the performance of an Init operation on MapState keys. | Nanoseconds | Custom metric |
| Maximum latency of an Init operation on MapState values | This metric measures the performance of an Init operation on MapState values. | Nanoseconds | Custom metric |
| Maximum latency of an Init operation on MapState Iterator | This metric measures the performance of an Init operation on MapState Iterator. | Nanoseconds | Custom metric |
| Maximum latency of an Empty operation on MapState data | This metric measures the performance of an Empty operation on MapState data. | Nanoseconds | Custom metric |
| Maximum latency of a HasNext operation on MapState Iterator | This metric measures the performance of a HasNext operation on MapState Iterator. | Nanoseconds | Custom metric |
| Maximum latency of a Next operation on MapState Iterator | This metric measures the performance of a Next operation on MapState Iterator. | Nanoseconds | Custom metric |
| Maximum latency of a Remove operation on MapState Iterator | This metric measures the performance of a Remove operation on MapState Iterator. | Nanoseconds | Custom metric |
| Maximum latency of a Get operation on ListState data | This metric measures the performance of accessing ListState data by an operator. | Nanoseconds | Custom metric |
| Maximum latency of an Add operation on ListState data | This metric measures the performance of an Add operation on ListState data. | Nanoseconds | Custom metric |
| Maximum latency of an AddAll operation on ListState data | This metric measures the performance of an AddAll operation on ListState data. | Nanoseconds | Custom metric |
| Maximum latency of an Update operation on ListState data | This metric measures the performance of an Update operation on ListState data. | Nanoseconds | Custom metric |
| Maximum latency of a Merge Namespace operation on ListState data | This metric measures the performance of a Merge Namespace operation on ListState data. | Nanoseconds | Custom metric |
| Maximum latency of accessing the first entry of SortedMapState data | This metric measures the performance of accessing SortedMapState data by an operator. | Nanoseconds | Custom metric |
| Maximum latency of accessing the last entry of SortedMapState data | This metric measures the performance of accessing SortedMapState data by an operator. | Nanoseconds | Custom metric |
| State data size | This metric helps you perform the following operations:
| Bytes | Custom metric |
| Size of the state data file | This metric helps you perform the following operations:
| Bytes | Custom metric |
| Time when each task received the latest watermark | This metric measures the latency of data receiving by the TaskManager. | None | Custom metric |
| Watermark latency | This metric measures the latency of subtasks. | Milliseconds | Custom metric |
| CPU load of the JobManager | If the value of this metric is greater than 100% for an extended period of time, the CPU is busy and the CPU load is high. This may affect the system performance. As a result, issues such as system stuttering and slow response occur. | None | Basic metric |
| Amount of JobManager heap memory | None | Bytes | Basic metric |
| Amount of heap memory committed by the JobManager | None | Bytes | Basic metric |
| Maximum amount of heap memory of the JobManager | None | Bytes | Basic metric |
| Amount of non-heap memory of the JobManager | None | Bytes | Basic metric |
| Amount of non-heap memory committed by the JobManager | None | Bytes | Basic metric |
| Maximum amount of non-heap memory of the JobManager | None | Bytes | Basic metric |
| Number of threads of the JobManager | A large number of threads of the JobManager occupy excessive memory space, reducing the job stability. | N/A | Basic metric |
| Number of GCs performed within the JobManager | Frequent GCs lead to excessive memory consumption and negatively affect job performance. This metric helps diagnose job issues and identify the causes of job failures. | N/A | Basic metric |
| Number of JobManager young GCs | None | N/A | Custom metric |
| Number of JobManager old GCs | None | N/A | Custom metric |
| Duration of JobManager young GCs | None | Milliseconds | Custom metric |
| Duration of JobManager old GCs | None | Milliseconds | Custom metric |
| Number of GCs performed by the Concurrent Mark Sweep (CMS) garbage collector of the JobManager | None | N/A | Basic metric |
| Duration of each JobManager GC | If JobManager GCs last for an extended period of time, excessive memory space is occupied, affecting the job performance. This metric helps diagnose job issues and identify the causes of job failures. | Milliseconds | Basic metric |
| GC duration of the JobManager CMS garbage collector. | None | Milliseconds | Basic metric |
| Total number of classes that were loaded after the Java virtual machine (JVM) in which the JobManager resides was created | If the total number of classes that were loaded is excessively large after the JVM in which the JobManager resides was created, excessive memory space is occupied, affecting the job performance. | None | Basic metric |
| Total number of classes that were unloaded after the JVM in which the JobManager resides was created | If the total number of classes that were unloaded is excessively large after the JVM in which the JobManager resides was created, excessive memory space is occupied, affecting the job performance. | None | Basic metric |
| CPU load of the TaskManager | This metric indicates the total number of processes in which the CPU was calculating data and processes in which data waited to be calculated by the CPU. In most cases, this metric indicates how busy the CPU is. The value of this metric is related to the number of CPU cores that were used. The CPU load in Flink is calculated by using the following formula: CPU load = CPU utilization/Number of CPU cores. If the value of the | None | Basic metric |
| CPU utilization of the JobManager | This metric indicates the utilization of CPU time slices that were occupied by Flink.
If the value of this metric is greater than 100% for an extended period of time, the CPU is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that were in the uninterruptible sleep state may be running due to frequent read and write operations. | None | Basic metric |
| CPU utilization of the TaskManager | This metric indicates the utilization of CPU time slices that were occupied by Flink.
If the value of this metric is greater than 100% for an extended period of time, the CPU is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that were in the uninterruptible sleep state may be running due to frequent read and write operations. | None | Basic metric |
| Amount of heap memory of the TaskManager | None | Bytes | Basic metric |
| Amount of heap memory committed by the TaskManager | None | Bytes | Basic metric |
| Maximum amount of heap memory of the TaskManager | None | Bytes | Basic metric |
| Amount of non-heap memory of the TaskManager | None | Bytes | Basic metric |
| Amount of non-heap memory committed by the TaskManager | None | Bytes | Basic metric |
| Maximum amount of non-heap memory of the TaskManager | None | Bytes | Basic metric |
| Amount of memory consumed by the entire process on Linux | This metric tracks changes in memory consumption of the process. | Bytes | Basic metric |
| Number of threads of the TaskManager | A large number of threads of the TaskManager occupies excessive memory space, reducing the job stability. | N/A | Basic metric |
| Number of GCs performed within the TaskManager | Frequent GCs lead to excessive memory consumption and negatively affect job performance. This metric helps diagnose job issues and identify the causes of job failures. | N/A | Basic metric |
| Number of TaskManager young GCs | None | N/A | Custom metric |
| Number of TaskManager old GCs | None | N/A | Custom metric |
| Duration of TaskManager young GCs | None | Milliseconds | Custom metric |
| Duration of TaskManager old GCs | None | Milliseconds | Custom metric |
| Number of GCs performed by the CMS garbage collector of the TaskManager | None | N/A | Basic metric |
| Duration of each TaskManager GC | If TaskManager GCs last for an extended period of time, excessive memory space is occupied, affecting the job performance. This metric helps diagnose job issues and identify the causes of job failures. | Milliseconds | Basic metric |
| GC duration of the TaskManager CMS garbage collector | None | Milliseconds | Basic metric |
| Total number of classes that were loaded after the JVM in which the TaskManager resides was created | If the total number of classes that were loaded is excessively large after the JVM in which the TaskManager resides was created, excessive memory space is occupied, affecting the job performance. | None | Basic metric |
| Total number of classes that were unloaded after the JVM in which the TaskManager resides was created | If the total number of classes that were unloaded is excessively large after the JVM in which the TaskManager resides was created, excessive memory space is occupied, affecting the job performance. | None | Basic metric |
| The period during which the job runs. | None | Milliseconds | Custom metric |
| Number of running jobs | None | None | Custom metric |
| Number of available task slots | None | None | Custom metric |
| Total number of task slots | None | None | Custom metric |
| Number of registered TaskManagers | None | None | Custom metric |
| Number of bytes read from the remote source per second | None | Bytes/s | Custom metric |
| Number of packets dropped due to window latency | None | N/A | Custom metric |
| Window latency rate | None | None | Custom metric |
| Whether the job was in the full data phase | This metric indicates the job processing phase. | None | Custom metric |
| Whether the job was in the incremental phase | This metric indicates the job processing phase. | None | Custom metric |
| Number of unprocessed tables in the full data phase | This metric measures the number of unprocessed tables. | N/A | Custom metric |
| Number of tables waiting to be processed in the full data phase | This metric measures the number of unprocessed tables. | N/A | Custom metric |
| Number of processed tables in the full data phase | This metric measures the number of processed tables. | N/A | Custom metric |
| Number of processed shards in the full data phase | This metric measures the number of processed shards. | N/A | Custom metric |
| Number of shards waiting to be processed in the full data phase | This metric measures the number of unprocessed shards. | N/A | Custom metric |
| Number of shards waiting to be processed in the full data phase | This metric measures the number of unprocessed shards. | N/A | Custom metric |
| Timestamp of the latest read data record | This metric measures the time of the latest binary log data. | Milliseconds | Custom metric |
| Number of processed data records in the full data phase | This metric measures the number of processed data records in the full data phase. | N/A | Custom metric |
| Number of data records read from each table | This metric measures the total number of processed data records in each table. | N/A | Custom metric |
| Number of processed data records in each table in the full data phase | This metric measures the number of processed data records in each table in the full data phase. | N/A | Custom metric |
| Number of executed INSERT DML statements for each table in the incremental phase | This metric measures the number of executed INSERT statements for each table. | N/A | Custom metric |
| Number of executed UPDATE DML statements for each table in the incremental phase | This metric measures the number of executed UPDATE statements for each table. | N/A | Custom metric |
| Number of executed DELETE DML statements for each table in the incremental phase | This metric measures the number of executed DELETE statements for each table. | N/A | Custom metric |
| Number of executed DDL statements for each table in the incremental phase | This metric measures the number of executed DDL statements for each table. | N/A | Custom metric |
| Number of executed INSERT DML statements in the incremental phase | This metric measures the number of executed INSERT statements. | N/A | Custom metric |
| Number of executed UPDATE DML statements in the incremental phase | This metric measures the number of executed UPDATE statements. | N/A | Custom metric |
| Number of executed DELETE DML statements in the incremental phase | This metric measures the number of executed DELETE statements. | N/A | Custom metric |
| Number of executed DDL statements in the incremental phase | This metric measures the number of executed DDL statements. | N/A | Custom metric |
Common metric labels
Label | Description |
| The name of the namespace. |
| The deployment name. |
| The deployment ID. |
| The job ID. |
References
For information about the metrics of Application Real-Time Monitoring Service (ARMS) Application Monitoring, see Application Monitoring metrics.