You can monitor the performance of a running deployment based on the metrics related to its JobManager and TaskManagers. For example, you can view the utilization of CPU, memory, and threads. This helps you identify potential issues, such as code errors, slow class initialization, and high resource utilization by specific classes. This topic describes how to view the performance of the JobManager and TaskManagers of a running deployment.
Prerequisites
The required permissions are granted to the Alibaba Cloud accounts or Resource Access Management (RAM) users used to access your Realtime Compute for Apache Flink namespace. For more information, see Grant namespace permissions.
Limits
Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.11 or later allows you to view the deployment performance.
The performance monitoring feature applies only to running deployments. You cannot view the performance of historical deployments.
Use performance analysis tools
Flame Graph
Flame graphs may fail to capture the complete execution context because flame graphs are generated based on sampled data in most cases. To enhance the accuracy of bottleneck diagnosis, we recommend that you use flame graphs together with other performance analysis tools while re-examining your actual code. You can identify performance bottlenecks based on the following factors:
CPU consumption: In most cases, a wider frame in the graph indicates that the related top-level function call consumes more CPU resources than other calls, which may result in a performance issue.
Memory allocation: the memory usage of different functions.
Lock: potential performance issues caused by lock contention or deadlocks.
ITimer: the CPU consumption of all threads within a specific interval.
Procedure for using a flame graph to identify potential performance bottlenecks:
View the flame graph structure.
A flame graph consists of multiple layers of stack frames. Each layer represents a level in the call stack. The bottom layer indicates the entry point to the application, and the upper layers indicate higher-level function calls.
Focus on the width and frequency of stack frames.
A wider stack frame indicates that the function consumes more CPU time than other functions, which causes bottleneck issues in most cases. If specific stack frames frequently appear, the corresponding functions are repeatedly called, which may cause performance issues.
Determine the call stack level.
The vertical position of a stack frame indicates the call stack level. In most cases, wide frames at the bottom indicate that issues occur at an early stage or in the main part of the program, whereas wide frames at the top indicate issues related to a specific function.
Re-examine your code.
Re-examine your code and optimize the implementation of the hot spots that you identified in the previous steps. For example, you can reduce the number of loops, improve the data structure, and reduce synchronization operations.
Run performance tests.
Run performance tests to validate your code optimization. You can compare the flame graphs before and after optimization to check whether the bottlenecks are eliminated.
If non-Java functions exist in your code, the corresponding stack frames are labeled with the keyword "unknown" in the flame graph. For more information, visit GitHub.
Threads
Go to the Debug tab.
Performance of the JobManager
On the Logs tab, click the Job Manager tab and click the Debug tab.
Performance of running TaskManagers
On the Logs tab, click the Running Task Managers tab, click the value in the Path, ID column, and then click Debug.
Go to the Threads tab, find the operator that you want to manage and click Sample in the Actions column. In the window that appears, wait for a period of time for the operator to be sampled. Then, check the thread stacks. The following figure shows the thread stacks that are accessed by Gemini State.
Thread Dump
On the Logs tab, click the Running Task Managers tab and click the value in the Path, ID column.
Go to the Thread Dump tab, search for an operator that is used to process state data by name, and then check whether thread stacks that contain interaction information between the operator and GeminiStateBackend or RocksDBStateBackend are displayed under the operator.
You can view the name of the operator on the Status tab.
References
The intelligent deployment diagnostics feature can help you monitor the health status of your deployments and ensure the stability and reliability of your business. For more information, see Perform intelligent deployment diagnostics.
You can use deployment configurations and Flink SQL optimization to improve the performance of Flink SQL deployments. For more information, see Optimize Flink SQL.