Monitor deployment performance - Realtime Compute for Apache Flink

You can monitor the performance of a running deployment based on the metrics related to its JobManager and TaskManagers. For example, you can view the utilization of CPU, memory, and threads. This helps you identify potential issues, such as code errors, slow class initialization, and high resource utilization by specific classes. This topic describes how to view the performance of the JobManager and TaskManagers of a running deployment.

Prerequisites

The required permissions are granted to the Alibaba Cloud accounts or Resource Access Management (RAM) users used to access your Realtime Compute for Apache Flink namespace. For more information, see Grant namespace permissions.

Limits

Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.11 or later allows you to view the deployment performance.
The performance monitoring feature applies only to running deployments. You cannot view the performance of historical deployments.

Procedure

Log on to the Realtime Compute for Apache Flink console.
Find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click O&M > Deployments.
On the Deployments page, find the deployment that you want to manage and click its name. On the page that appears, click the Logs tab.

View deployment performance on the tabs that are described in the following table.

Tab	Description
Flame Graph	Flame graphs can visualize performance bottlenecks during the execution of a software program. Flame graphs use a layered structure to display call stacks and highlight the most frequently executed code segments. For more information, see Flame Graphs. Flame graphs provide an intuitive method for identifying CPU-intensive functions, which are also known as hot spots, in a program. This helps you perform targeted optimization.
Memory	You can view the memory usage in different spaces of the Java Virtual Machine (JVM).
Threads	You can view the details of each thread and select a thread for sampling and analysis.
Thread Dump	You can view information about all threads at the current time.

Use performance analysis tools

Flame Graph

Flame graphs may fail to capture the complete execution context because flame graphs are generated based on sampled data in most cases. To enhance the accuracy of bottleneck diagnosis, we recommend that you use flame graphs together with other performance analysis tools while re-examining your actual code. You can identify performance bottlenecks based on the following factors:

CPU consumption: In most cases, a wider frame in the graph indicates that the related top-level function call consumes more CPU resources than other calls, which may result in a performance issue.
Memory allocation: the memory usage of different functions.
Lock: potential performance issues caused by lock contention or deadlocks.
ITimer: the CPU consumption of all threads within a specific interval.

查看作业性能.jpg

Procedure for using a flame graph to identify potential performance bottlenecks:

View the flame graph structure.
A flame graph consists of multiple layers of stack frames. Each layer represents a level in the call stack. The bottom layer indicates the entry point to the application, and the upper layers indicate higher-level function calls.
Focus on the width and frequency of stack frames.
A wider stack frame indicates that the function consumes more CPU time than other functions, which causes bottleneck issues in most cases. If specific stack frames frequently appear, the corresponding functions are repeatedly called, which may cause performance issues.
Determine the call stack level.
The vertical position of a stack frame indicates the call stack level. In most cases, wide frames at the bottom indicate that issues occur at an early stage or in the main part of the program, whereas wide frames at the top indicate issues related to a specific function.
Re-examine your code.
Re-examine your code and optimize the implementation of the hot spots that you identified in the previous steps. For example, you can reduce the number of loops, improve the data structure, and reduce synchronization operations.
Run performance tests.
Run performance tests to validate your code optimization. You can compare the flame graphs before and after optimization to check whether the bottlenecks are eliminated.

Note

If non-Java functions exist in your code, the corresponding stack frames are labeled with the keyword "unknown" in the flame graph. For more information, visit GitHub.

Threads

Go to the Debug tab.
- Performance of the JobManager
  On the Logs tab, click the Job Manager tab and click the Debug tab.
- Performance of running TaskManagers
  On the Logs tab, click the Running Task Managers tab, click the value in the Path, ID column, and then click Debug.
Go to the Threads tab, find the operator that you want to manage and click Sample in the Actions column. In the window that appears, wait for a period of time for the operator to be sampled. Then, check the thread stacks. The following figure shows the thread stacks that are accessed by Gemini State.

Thread Dump

On the Logs tab, click the Running Task Managers tab and click the value in the Path, ID column.
Go to the Thread Dump tab, search for an operator that is used to process state data by name, and then check whether thread stacks that contain interaction information between the operator and GeminiStateBackend or RocksDBStateBackend are displayed under the operator.
You can view the name of the operator on the Status tab.

References

The intelligent deployment diagnostics feature can help you monitor the health status of your deployments and ensure the stability and reliability of your business. For more information, see Perform intelligent deployment diagnostics.
You can use deployment configurations and Flink SQL optimization to improve the performance of Flink SQL deployments. For more information, see Optimize Flink SQL.