FAQ about the monitoring and alerting features and logs - Realtime Compute for Apache Flink

This topic provides answers to some frequently asked questions (FAQ) about the monitoring and alerting features and logs.

What do I do if the TaskManager logs of a DataStream deployment contain a NullPointerException error but do not provide the details of the error stack?
How do I configure parameters at the log level for a single class?
How do I enable GC logging in the Realtime Compute for Apache Flink console?
What do I do if a DataStream deployment is not delayed but the values of delay-related metrics for output data indicate a delay in the deployment?
How do I resolve the issue that the logs generated by using a non-static method cannot be exported to Simple Log Service?
What do I do if Kafka can receive data that is written from Realtime Compute for Apache Flink but the value in the Records Received column on the Status tab of the related deployment is 0?
What do I do if a deployment startup error is reported after I configure parameters to export the logs of the deployment to Simple Log Service?
What are the limits of the alerting feature of CloudMonitor compared with ARMS?

What do I do if the TaskManager logs of a DataStream deployment contain a NullPointerException error but do not provide the details of the error stack?

On the O&M > Deployments page, find the deployment that you want to manage and click its name. On the Configuration tab, click Edit in the upper-right corner of the Parameters section and add the following code to the Other Configuration field:

env.java.opts: "-XX:-OmitStackTraceInFastThrow"

How do I configure parameters at the log level for a single class?

For example, if you specify log4j.logger.org.apache.kafka.clients.consumer=trace for an ApsaraMQ for Kafka source table and specify log4j.logger.org.apache.kafka.clients.producer=trace for an ApsaraMQ for Kafka result table when you use the Kafka connector, you must configure the parameters in the Log Levels field in the Logging section of the Configuration tab. You cannot configure parameters in the Other Configuration field of the Parameters section. 参数设置

How do I enable GC logging in the Realtime Compute for Apache Flink console?

env.java.opts: >-
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/flink/log/gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=50M

What do I do if a DataStream deployment is not delayed but the values of delay-related metrics for output data indicate a delay in the deployment?

Problem description
Data is continuously read by using a source table of Flink, and the Kafka connector continuously writes the data to each partition of an ApsaraMQ for Kafka physical table. However, the values of the CurrentEmitEventTimeLag and CurrentFetchEventTimeLag metrics for the DataStream deployment indicate that the deployment is delayed for 52 years.
Cause
The Kafka connector in the DataStream deployment is provided by the Apache Flink community and is not a built-in connector that is supported by Ververica Platform (VVP). Connectors that are supported by the Apache Flink community do not support metric-based monitoring. As a result, the values of the metrics are abnormal.
Solution
Use the dependencies of connectors that are supported by VVP. For more information, see Ververica Maven Repository.

How do I resolve the issue that the logs generated by using a non-static method cannot be exported to Simple Log Service?

Problem description
The logic of Logger and Appender in Log4j Appender is used in Simple Log Service. As a result, the logs that are generated by using a non-static method cannot be exported to Simple Log Service.
Solution
Use the static method private static final Logger LOG = LoggerFactory.getLogger(xxx.class);.

What do I do if Kafka can receive data that is written from Realtime Compute for Apache Flink but the value in the Records Received column on the Status tab of the related deployment is 0?

Problem description
The deployment has only one data operator. The source operator has no input but only output and the sink operator has only input but no output. In this case, the amount of data that is read and written in the deployment topology cannot be viewed.
Solution
Split the operators to view the amount of data in the deployment topology. Split the source and sink operators as independent operators from the topology. Then, separately connect the source operator and sink operator with other operators to form a new topology. You can view the data flow and traffic in the new topology.
On the O&M > Deployments page, find the deployment that you want to manage and click its name. On the Configuration tab, click Edit in the upper-right corner of the Parameters section and add pipeline.operator-chaining: 'false' to the Other Configuration field.

What do I do if a deployment startup error is reported after I configure parameters to export the logs of the deployment to Simple Log Service?

Problem description

After the parameters are configured to export the logs of the deployment to Simple Log Service, the "Failed to start the deployment. Try again." error message appears during the startup of the deployment and the following error message is also reported:

Unknown ApiException {exceptionType=com.ververica.platform.appmanager.controller.domain.TemplatesRenderException, exceptionMessage=Failed to render {userConfiguredLoggers={}, jobId=3fd090ea-81fc-4983-ace1-0e0e7b******, rootLoggerLogLevel=INFO, clusterName=f7dba7ec27****, deploymentId=41529785-ab12-405b-82a8-1b1d73******, namespace=flinktest-default, priorityClassName=flink-p5, deploymentName=test}}
029999 202312121531-8SHEUBJUJU

Cause
The values of the variables in Twig templates, such as namespace and deploymentId, are changed when you configure the parameters to export logs of the deployment to Simple Log Service.
Solution
Reconfigure the parameters based on your business requirements. For more information, see Configure parameters to export logs of a deployment.

What are the limits of the alerting feature of CloudMonitor compared with ARMS?

The query and analysis statements are not supported.
Only the current metric charts of a deployment are displayed. Historical metric charts are unavailable. This causes inefficiency when you compare records per second (RPS) during multi-round optimization.
The metric charts of a subtask are unavailable. In scenarios that have multiple sources and subtasks, the latency problem occurred after clustering cannot be identified in an intuitive and efficient manner.
You cannot view the metrics reported by using internal code instrumentation. This may cause inconvenience for troubleshooting.