Configure monitoring and alerting - Realtime Compute for Apache Flink

Realtime Compute for Apache Flink allows you to use CloudMonitor or Managed Service for Prometheus of Application Real-Time Monitoring Service (ARMS) to implement deployment monitoring and alerting by configuring alert rules for metrics or subscribing to event-triggered alerts. CloudMonitor is free of charge. This helps you detect and handle exceptions at the earliest opportunity. This topic describes how to configure alert rules by using CloudMonitor or Managed Service for Prometheus of ARMS.

Limits

You cannot configure alert rules for Realtime Compute for Apache Flink deployments that are deployed in session clusters.
You cannot configure alert rules for batch deployments of Realtime Compute for Apache Flink.

Configuration guide

CloudMonitor: You must go to the CloudMonitor console to configure alert rules for metrics or subscribe to event-triggered alerts.
- For more information about how to configure alert rules for metrics for a single deployment or multiple deployments at a time, see the Configure alert rules for metrics section of this topic.
- For more information about how to subscribe to event-triggered alerts for deployments or workflows, see the Subscribe to event-triggered alerts section of this topic.
Managed Service for Prometheus of ARMS:
- You can configure alert rules for six specific metrics and subscribe to alerts triggered by deployment failures for a single deployment in the development console of Realtime Compute for Apache Flink. For more information, see Development console of Realtime Compute for Apache Flink in the "Configure alert rules for metrics" section of this topic and Development console of Realtime Compute for Apache Flink in the "Subscribe to event-triggered alerts" section of this topic.
- You can use static thresholds and the PromQL syntax in the ARMS console to configure alert rules for other metrics for a single deployment or multiple deployments at a time. For more information, see ARMS in the "Configure alert rules for metrics" section of this topic.
- You can subscribe to event-triggered alerts in the CloudMonitor console. You can subscribe to alerts only for the Elastic Compute Service (ECS) failure handling events, ECS proactive O&M events, and workflow events. For more information, see CloudMonitor in the "Subscribe to event-triggered alerts" section of this topic.

Configure alert rules for metrics

CloudMonitor

Important

Only the Alibaba Cloud account that is used to purchase the specified workspace and the Resource Access Management (RAM) users and RAM roles that have permissions on the namespaces within the Alibaba Cloud account can be used to configure alert rules for metrics in the CloudMonitor console.

Configure alert rules for metrics for a single deployment

Log on to the Realtime Compute for Apache Flink console. Find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, choose O&M > Deployments. Find the deployment that you want to manage and click its name.
On the page that appears, click the Alarm tab. In the upper-right corner of the Alarm tab, click Subscribe to indicator alerts to go to the CloudMonitor console.
In the Configure Rule Description panel of the CloudMonitor console, configure the parameters and click OK.
Select Simple Metric or Combined Metrics for the Metric Type parameter. In the Dimension section, select a namespace and a deployment ID to specify the monitoring scope. Set the namespace parameter to the name of the specified Realtime Compute for Apache Flink namespace. Set the deploymentID parameter to the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page. For more information about other alert parameters, see Create an alert rule.
Note
The namespace drop-down list displays only the namespaces in which monitoring data is generated, and the deploymentId drop-down list displays only the IDs of the deployments in which monitoring data is generated. You can manually specify values for the namespace and deploymentId parameters if no values can be selected.
In the Create Alert Rule panel, configure other alert parameters.
By default, Instances is selected for the Resource Range parameter, and the value of the Associated Resources parameter is the ID of the specified workspace. You cannot change the value of the Associated Resources parameter after the alert rule is created. For more information about how to obtain the workspace ID, see Console operations. For more information about other alert parameters, see Create an alert rule.
Click Confirm.

Configure alert rules for metrics for multiple deployments at a time

Log on to the CloudMonitor console.
In the left-side navigation pane, choose Alerts > Alert Rules.

Click Create Alert Rule and configure the parameters. For more information, see Create an alert rule.

Parameter	Description
Product	Select Flink.
Resource Range	All Resources: The alert rule applies to all resources in Realtime Compute for Apache Flink. Instances: The alert rule applies to the specified workspace of Realtime Compute for Apache Flink. Click Add Instance. In the Add Instance dialog box, select a workspace in the region in which your workspace resides and click OK.
Rule Description	Click Add Rule and select Simple Metric or Combined Metrics. The Configure Rule Description panel appears. For more information about the parameters, see Create an alert rule. In the Dimension section, select a namespace and a deployment ID to specify the monitoring scope. Set the namespace parameter to the name of the specified Realtime Compute for Apache Flink namespace. Set the deploymentID parameter to the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page. If you configure only the namespace parameter, the settings apply to all deployments in the specified namespace. If you leave both the parameters empty, the settings apply to all deployments in the specified workspace. Note You can manually specify values for the namespace and deploymentId parameters if no values can be selected.

Development console of Realtime Compute for Apache Flink

Note

You can view only the alert events within the last 48 hours in the development console of Realtime Compute for Apache Flink. If you want to view the alert events generated at earlier time, go to the Alert Management page in the ARMS console.

In the development console of Realtime Compute for Apache Flink, you can configure alert rules for metrics only for a single deployment. You can create an alert rule for a deployment directly or by using an existing alert rule template. The use of an alert rule template helps improve your configuration efficiency.

Go to the Alarm tab.
1. Log on to the Realtime Compute for Apache Flink console. Find the workspace that you want to manage and click Console in the Actions column.
2. In the left-side navigation pane, choose O&M > Deployments. Find the deployment that you want to manage and click its name.
3. Click the Alarm tab.
On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Custom Rule.
You can also create an alert rule by using an alert rule template. To create an alert rule by using an alert rule template, choose Add Rule > Create Rule by Template, click the name of the template that you want to use, and then perform the subsequent steps. You can modify the template parameters based on your business requirements. This helps improve your configuration efficiency.

In the Create Rule panel, configure the parameters. The following table describes the parameters.

Section	Parameter	Description
Rule	Name	The name of the alert rule. The name must be 3 to 64 characters in length and can contain lowercase letters, digits, and underscores (_). It must start with a lowercase letter.
	Description	The description of the alert rule.
	Content	The conditions that trigger an alert. After you configure the conditions, Realtime Compute for Apache Flink compares the values of specified metrics with the thresholds that are specified in the conditions at the interval you specify. If one of the conditions resolves to true, an alert is triggered. Metric: Restart Count in 1 Minute: the number of times that the JobManager restarts deployments in 1 minute. Checkpoint Count in 5 Minutes: the number of times that checkpointing succeeds in 5 minutes. Emit Delay: the processing delay. This parameter specifies the difference between the time when data is generated and the time when data leaves the source operator. Unit: seconds. Important The time when the data is generated depends on the timestamp that is recorded in the external system. If no timestamp is recorded in the external system or the timestamp that is recorded when data is written to the external system is incorrect, the value of the Emit Delay parameter is invalid and cannot be used to determine the true processing delay. IN RPS: the number of input data records per second. OUT RPS: the number of output data records per second. Source Idle Time: the duration for which data is not processed in the source. Unit: milliseconds. Time Interval: the interval within which data of a metric is collected every minute. Realtime Compute for Apache Flink obtains data of the metric within the last interval and compares the obtained data with the specified threshold. If the historical data meets the specified conditions of the alert rule, an alert is triggered. Comparator: The greater-than-or-equal-to sign (>=) and the less-than-or-equal-to sign (<=) are supported. Thresholds: the value that is used to compare with the value of a metric. If you set the Comparator parameter to the greater-than-or-equal-to sign (>=), the maximum value of the metric on the vertical axis within the last interval is used. If the maximum value of the metric within the last interval is greater than or equal to the threshold, an alert is triggered. If you set the Comparator parameter to the less-than-or-equal-to sign (<=), the minimum value of the metric on the vertical axis within the last interval is used. If the minimum value of the metric within the last interval is less than or equal to the threshold, an alert is triggered. For example, you can set the Time Interval parameter to 5 minutes, the Comparator parameter to the less-than-or-equal-to sign (<=), and the Thresholds parameter to 2. In this case, Realtime Compute for Apache Flink obtains the values of a metric within the last 5 minutes on the vertical axis and compares the minimum value of the metric with the specified threshold. If the minimum value of the metric within the specified interval is less than or equal to the threshold, an alert is triggered.
	Effective Time	The time period during which the alert rule is effective. If you do not specify a time period, all alert rules are effective throughout the day. For example, you can specify a time period from 09:00 to 18:00.
	Alarm Rate	The interval at which an alert is reported. Unit: minutes. You can set this parameter to a value in a range from 1 minute to 1440 minutes (24 hours).
Notification	Notification	The method that is used to send a notification. You can select multiple notification methods. Valid values: DingTalk Email SMS Webhook Phone Make sure that the contact you added can receive alert notifications. Otherwise, alert notifications cannot be sent. Click Notification object management on the right side of Notification object to check whether the phone number passes verification. If Unverified appears in the Phone column of a contact on the Contact tab, click Unverified to complete verification. Important Make sure that one or more contacts are created and added. Otherwise, alert notifications cannot be sent. For example, if you select DingTalk for the Notification parameter, a DingTalk chatbot must be added as a contact.
	Notification object	The contacts to which alert notifications are sent. You can select multiple contacts. You can directly select or search for a contact. You must create contacts before you select contacts. To manage contacts, perform the following operations: Click Notification object management on the right side of Notification object. In the Edit Contact Group dialog box, click Edit in the Actions column on the Contact Group, Contact, Webhook, and DingTalk tabs separately, edit information, and then click Save. For more information about how to add a DingTalk or Lark chatbot and a webhook, see the FAQ section of this topic.
	Alarm Noise Reduction	After you click Advanced Settings, you can turn on Alarm Noise Reduction. After you turn on Alert Noise Reduction, the system does not send alert notifications if a deployment can quickly resume due to a short-period failover. For example, in cluster scheduling or automatic tuning scenarios, a deployment may perform a failover for a short period of time. The system sends alert notifications only when the specified threshold condition is continuously met.
	No Data Alarms	After you click Advanced Settings, you can turn on No Data Alarms and specify the time period during which no data is generated. After you turn on this switch, data that is monitored based on codeless tracking is reported. If no data is reported during the specified time period, the system sends an alert notification. In most cases, if an issue, such as an exception of the JobManager, abnormal deployment cancellation, or an exception of the report trace, occurs, data that is monitored based on codeless tracking is reported.

Click OK.
After you create an alert rule, the rule is immediately effective. You can stop, edit, or delete the alert rule in the alert rule list.

ARMS

Note

If you access Realtime Compute for Apache Flink as a RAM user or by using a RAM role, the RAM user or RAM role must have the permissions to access ARMS. For more information, see Overview.

Configure alert rules for metrics for a single deployment

Log on to the Realtime Compute for Apache Flink console.
Find the workspace that you want to manage and choose More > Monitoring Indicator Configuration in the Actions column to go to the ARMS console.
On the page that appears, view the workspace name, workspace ID, and the name of the Prometheus instance that corresponds to the workspace.
In the left-side navigation pane, click Alert rules. On the page that appears, click Create Prometheus Alert Rule to create an alert rule.
- Check Type: You can select Static Threshold or Custom PromQL for metrics except those supported by Realtime Compute for Apache Flink.
- Filter Conditions: Set the Namespace parameter to the name of the specified Realtime Compute for Apache Flink namespace. Set the Deployment Name parameter to the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page. If you select Equal, the settings apply to the specified deployment.
For more information about the parameters, see Create an alert rule for a Prometheus instance.

Configure alert rules for metrics for multiple deployments at a time

Log on to the Realtime Compute for Apache Flink console.
Find the workspace that you want to manage and choose More > Monitoring Indicator Configuration in the Actions column to go to the ARMS console.
On the page that appears, view the workspace name, workspace ID, and the name of the Prometheus instance that corresponds to the workspace.
In the left-side navigation pane, click Alert rules. On the page that appears, click Create Prometheus Alert Rule to create an alert rule.
- Check Type: You can select Static Threshold or Custom PromQL for metrics except those supported by Realtime Compute for Apache Flink.
- Filter Conditions: Configure filter conditions for selecting multiple deployments. Set the namespace parameter to the name of the specified Realtime Compute for Apache Flink namespace. Set the deploymentID parameter to the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page. If you select ALL, the settings apply to all deployments in the specified namespace.
For more information about the parameters, see Create an alert rule for a Prometheus instance. You can also create an alert rule template for a Prometheus instance. For more information, see Create and manage an alert rule template.

Subscribe to event-triggered alerts

CloudMonitor

Important

Only the Alibaba Cloud account that is used to purchase the specified workspace and the RAM users and RAM roles that have permissions on the namespaces within the Alibaba Cloud account can be used to subscribe to event-triggered alerts in the CloudMonitor console.

Subscribe to event-triggered alerts for deployments

You can configure alert conditions to subscribe to system event-triggered alerts. You can subscribe to multiple event-triggered alerts at a time.

Log on to the CloudMonitor console.
In the left-side navigation pane, choose Event Center > Event Subscription.
On the Subscription Policy tab, click Create Subscription Policy.
On the Create Subscription Policy page, configure the parameters.
For more information about the parameters, see Manage event subscription policies (recommended).
- Subscription Type: Select System events.
- Products: Select Realtime Compute for Apache Flink.
- Event name: You can select JOB_FAILED, ECS.SystemFailure, and ECS.SystemMaintenance if you set the Subscription Type parameter to System events. Managed Service for Prometheus of ARMS does not support the JOB_FAILED event. If you select the JOB_FAILED event, you can select only Critical for the Event Level parameter.
- Event Content: Enter the following information about Realtime Compute for Apache Flink to subscribe to event-triggered alerts for the specified one or more deployments:
  - Workspace ID: The system generates event-triggered alerts for all deployments in all namespaces of the specified workspace. For more information about how to obtain the workspace ID, see How do I view the information about a workspace, such as the workspace ID?
  - Namespace name: The system generates event-triggered alerts for all deployments in the specified namespace.
  - Deployment name: The system generates event-triggered alerts for the specified deployments. Separate multiple deployment names with commas (,). Check whether deployments with the same name exist within your account. If so, specify the deployment ID instead of the deployment name.
  - Deployment ID: The system generates event-triggered alerts for the specified deployments. Separate multiple deployment IDs with commas (,). You can obtain the ID of a deployment from the Deployment ID parameter on the Configuration tab.
Note
If you do not configure the Application grouping, Event Content, and Event Resources parameters, the subscription scope covers all workspaces within your account.

Subscribe to event-triggered alerts for workflows

You can configure alert conditions to subscribe to system event-triggered alerts for workflows of Realtime Compute for Apache Flink. You can subscribe to multiple event-triggered alerts at a time. For more information about workflows, see Manage workflows.

Obtain the resource ID of a task in a workflow.

Log on to the CloudMonitor console.
In the left-side navigation pane, choose Event Center > System Event.
On the Event Monitoring tab, select Realtime Compute for Apache Flink as the service, set the event name to flink:Workflow:TaskStateChange, and then click Search.

Find the resource ID of the task that you want to manage in the workflow.

工作流告警_cn.png

A resource ID is in the format of acs:flink:cn-hangzhou:<workspaceAlibaba CloudAccountId>:resourceId/workspaceId/<workspaceId-namespaceId>#workflowDefinitionName/<workflowDefinitionName>#taskDefinitionName/<taskDefinitionName>. You can obtain the resource ID of a task in a workflow by concatenating relevant values in this format.

Parameter	Description
`<workspaceAlibabaCloudAccountId>`	The ID of the Alibaba Cloud account that is used to create the Realtime Compute for Apache Flink workspace.
`<workspaceId-namespaceId>`	The `workspaceId` and `namespaceId` that are concatenated by using a hyphen (-). `workspaceId`: the ID of the workspace. To obtain the workspace ID, log on to the Realtime Compute for Apache Flink console, find the workspace that you want to manage, and then choose More > Workspace Details in the Actions column. `namespaceId`: the name of the namespace.
`<workflowDefinitionName>`	The name of the workflow.
`<taskDefinitionName>`	The name of the task in the workflow.

Note

Typically, a latency of several minutes exists before CloudMonitor displays state change events of workflows.

Subscribe to event notifications.
1. In the left-side navigation pane, choose Event Center > Event Subscription.
2. On the Subscription Policy tab, click Create Subscription Policy.
3. On the Create Subscription Policy page, configure the parameters.
  For more information about the parameters, see Manage event subscription policies (recommended).
  - Name: Enter a name for the subscription policy.
  - Subscription Type: Select System events.
  - Subscription Scope:
    - Products: Select Realtime Compute for Apache Flink.
    - Event name: Select flink:Workflow:TaskStateChange.
    - Event Content: Enter toState: FAILED.
      The following values can be configured:
      - toState: FAILED: indicates that the workflow failed.
      - toState: SUCCESS: indicates that the workflow succeeded.
      - fromState: SCHEDULED, toState: RUNNING: indicates that the workflow state changed from PENDING to RUNNING.
    - Event Resources: Enter the resource ID obtained in Step 1.
    - Leave the Event Type, Event Level, and Application grouping parameters empty.

Development console of Realtime Compute for Apache Flink

Note

In the development console of Realtime Compute for Apache Flink, you can configure a deployment failure-triggered alert rule only for a single deployment.

Go to the Alarm tab.
1. Log on to the Realtime Compute for Apache Flink console. Find the workspace that you want to manage and click Console in the Actions column.
2. In the left-side navigation pane, choose O&M > Deployments. Find the deployment that you want to manage and click its name.
3. Click the Alarm tab.
On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Custom Rule.
You can also create a deployment failure-triggered alert rule by using an alert rule template. To create a deployment failure-triggered alert rule by using an alert rule template, choose Add Rule > Create Rule by Template, click the name of the template that you want to manage, and then perform the subsequent steps. This helps improve your configuration efficiency.

In the Create Rule panel, configure the parameters. The following table describes the parameters.

Section	Parameter	Description
Rule	Name	The name of the alert rule. The name must be 3 to 64 characters in length and can contain lowercase letters, digits, and underscores (_). It must start with a lowercase letter.
	Description	The description of the alert rule.
	Content	The metric to be monitored. In this example, Job Failed is selected.
	Effective Time	The time period during which the alert rule is effective. If you do not specify a time period, all alert rules are effective throughout the day. For example, you can specify a time period from 09:00 to 18:00.
	Alarm Rate	The interval at which an alert is reported. Unit: minutes. You can set this parameter to a value in a range from 1 minute to 1440 minutes (24 hours).
Notification	Notification	The method that is used to send a notification. You can select multiple notification methods. Valid values: DingTalk Email SMS Webhook Phone Make sure that the contact you added can receive alert notifications. Otherwise, alert notifications cannot be sent. Click Notification object management on the right side of Notification object to check whether the phone number passes verification. If Unverified appears in the Phone column of a contact on the Contact tab, click Unverified to complete verification. Important Make sure that one or more contacts are created and added. Otherwise, alert notifications cannot be sent. For example, if you select DingTalk for the Notification parameter, a DingTalk chatbot must be added as a contact.
Notification	Notification object	The contacts to which alert notifications are sent. You can select multiple contacts. You can directly select or search for a contact. You must create contacts before you select contacts. To manage contacts, perform the following operations: Click Notification object management on the right side of Notification object. In the Edit Contact Group dialog box, click Edit in the Actions column on the Contact Group, Contact, Webhook, and DingTalk tabs separately, edit information, and then click Save. For more information about how to add a DingTalk chatbot and a webhook, see the FAQ section of this topic.

Click OK.
After you create an alert rule, the rule is immediately effective. You can stop, edit, or delete the alert rule in the alert rule list.

ARMS

If a workspace uses ARMS to provide the monitoring and alerting service, you can subscribe to event-triggered alerts in the CloudMonitor console. You can subscribe to alerts only for the ECS failure handling events and ECS proactive O&M events. For more information about how to subscribe to alerts for deployment failures, see Development console of Realtime Compute for Apache Flink in the "Subscribe to event-triggered alerts" section of this topic. For more information about how to subscribe to other alerts, see CloudMonitor in the "Subscribe to event-triggered alerts" section of this topic.

FAQ

How do I add a DingTalk chatbot in the development console of Realtime Compute for Apache Flink?

Add a custom DingTalk chatbot and obtain the webhook URL of the chatbot. For more information, see the "Add a custom DingTalk chatbot and obtain the webhook URL" section of the Configure a DingTalk chatbot to send alert notifications topic.
Important
To ensure that you receive alerts from a DingTalk chatbot, select at least Custom Keywords in the Security Settings section of the Add Robot dialog box, and configure Alarm as a keyword.
Add a notification object.
1. In the left-side navigation pane of the Realtime Compute for Apache Flink console, choose O&M > Deployments. Find the deployment that you want to manage and click its name. On the page that appears, click the Alarm tab.
2. On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Custom Rule or choose Add Rule > Create Rule by Template > Add Rule Template.
3. In the Create Rule or Create Rule Template panel, click Notification object management.
In the Edit Contact Group dialog box, click the DingTalk tab. On the DingTalk tab, click Add DingTalk.
In the Add DingTalk dialog box, configure the Name and URL parameters and click Submit.
Go back to the Create Rule or Create Rule Template panel in Step 2. Select DingTalk for Notification and select the related DingTalk chatbot from the Notification object drop-down list.
For more information about how to configure other parameters for an alert rule, see Development console of Realtime Compute for Apache Flink in the "Configure alert rules for metrics" section of this topic.
Click OK.

How do I add a webhook in the development console of Realtime Compute for Apache Flink?

In the Create Rule Template or Create Rule panel, click Notification object management.
In the Edit Contact Group dialog box, click the Webhook tab. On the Webhook tab, click Add Webhook.

In the Add Webhook dialog box, configure the parameters. The following table describes the parameters.

Parameter	Description
Name	Required. The name of the webhook that you want to add.
URL	Required. The webhook URL.
Headers	Optional. The request headers that store cookies and tokens. The format is key: value. Note Make sure that a space exists after the colon (:) between the key and the value.
Params	Optional. The request parameters that are in the key: value format. Note Make sure that a space exists after the colon (:) between the key and the value.
Body	Required. The request body that is used to store the POST request parameters and parameter values. You can use the $content placeholder in the request body. $content represents the actual alert message.

Click OK.

References

Realtime Compute for Apache Flink allows you to use CloudMonitor or Managed Service for Prometheus of ARMS to implement deployment monitoring and alerting. CloudMonitor is free of charge. For more information about the differences of features and costs between CloudMonitor and Managed Service for Prometheus of ARMS, see Comparison between CloudMonitor and Managed Service for Prometheus of ARMS.
ARMS supports escalation policies and schedule management for alert notifications. For more information, see Configure an escalation policy and Use Cases.
CloudMonitor allows you to receive alert notifications by using DingTalk groups or Lark groups. For more information, see Alert notification methods.
For more information about the metrics supported by Realtime Compute for Apache Flink, see Metrics.
If you no longer require Managed Service for Prometheus of ARMS for a workspace or a metric for a deployment in Realtime Compute for Apache Flink, you can disable Managed Service for Prometheus of ARMS for the workspace or discard the metric for the deployment. This helps reduce costs. You can restore the metric that you discard based on your business requirements. For more information, see Discard or restore metrics.