O&M Dashboard - DataWorks - Alibaba Cloud Documentation Center

The O&M Dashboard page in Operation Center displays the O&M stability assessment information, key O&M metrics, and scheduling resource usage overview of auto triggered nodes and the running details of manually triggered nodes. This page also displays information about data synchronization nodes in Data Integration. This helps you quickly understand the overall information about nodes in your workspace, identify and handle exceptions at the earliest opportunity, and improve O&M efficiency.

Usage notes

The O&M Dashboard page allows you to view the overall O&M information about your auto triggered nodes, manually triggered nodes, and data synchronization nodes in Data Integration from the following perspectives. For more information about O&M on auto triggered nodes, manually triggered nodes, and data synchronization nodes in Data Integration, see View O&M information about auto triggered nodes, View O&M information about manually triggered nodes, and View O&M information about data synchronization nodes in Data Integration.

Specified workspace: You can view O&M information about a specified workspace, including the overall O&M information about the auto triggered nodes and manually triggered nodes in the workspace, and the O&M information about data synchronization nodes in the workspace.
All workspaces: You can view the overall O&M information about all workspaces within your current account. You cannot separately view the O&M information about data synchronization nodes in Data Integration.

Limits

The development environment of Operation Center in workspaces in standard mode does not support the O&M Dashboard feature. For information about workspaces in standard mode, see Differences between workspaces in basic mode and workspaces in standard mode.
Note
You can switch between the production environment and the development environment in the top navigation bar of the Operation Center page.
Auto Triggered Node: This tab displays only O&M information about auto triggered nodes and instances.
Manually Triggered Node: This tab displays only O&M information about manually triggered workflows and manually triggered nodes and instances in the workflows.
Data Integration: This tab displays only O&M information about batch synchronization nodes and real-time synchronization nodes in Data Integration.

Go to the O&M Dashboard page

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

View O&M information about auto triggered nodes

On the Auto Triggered Node tab, you can view the O&M stability assessment information about auto triggered nodes, the items that need to be focused on, the distribution of auto triggered nodes in different O&M states, the completion status of auto triggered instances, and the resource usage of different resource groups for scheduling.

View information in the O&M Stability Assessment section

In the O&M Stability Assessment section, the O&M stability of your workspace is assessed based on the overall running details of nodes in your workspace.

Assessment scope	Single workspace	All your workspaces
Illustration
Stability description	The health status for O&M stability can be excellent, good, medium, or poor. If high-risk or low-risk items are displayed, the health status of the workspace is poor. You must handle the risky items and optimize the performance of the workspace at the earliest opportunity.	You can select All My Workspaces in the top navigation bar of the Operation Center page and view the O&M stability information about, the number of auto triggered instances in, and the completion status of auto triggered instances in all workspaces to which your account is added. You can also click View Details in the Actions column of a desired workspace to view the O&M stability information about the workspace.

View information in the Focus On section

The Focus On section displays the O&M exceptions from the workspace and individual perspectives based on exception statistics of intelligent baselines and auto triggered nodes. You can view the overall information in your workspace or view only the information about nodes of which you are the owner to identify and handle exceptions at the earliest opportunity and ensure that your business is not affected.

Exception type	Description	References	Illustration
Baseline in Overtime	Counts the number of baseline instances that are in the overtime state on the current day. If a node in a baseline is still running when the committed completion time of the baseline arrives, an instance that is generated for the node enters the overtime state.	Manage baseline instances
Baseline in Alert	Counts the number of baseline instances that are in the alert state on the current day. You can specify an alert margin threshold to ensure that important data is generated as expected in scenarios in which dependencies between nodes in the baseline are complex. If the alert margin threshold is exceeded, nodes may fail to finish running as expected and exceptions may occur.	Configure an appropriate committed point in time and an appropriate alert margin threshold for a baseline
Error-related Events	Counts the number of error-related events that are generated on the current day. An error-related event is generated if a node in a baseline fails. In this case, the running of descendant nodes of the node may be blocked. You must handle the error at the earliest opportunity to prevent the node from affecting the running of its descendant nodes.	Manage events
Slowdown Events	Counts the number of slowdown events that are generated on the current day. A slowdown event is generated if the running duration of a node in a baseline is significantly longer than the average running duration of the node in the historical periods of time.	Manage events
Isolated Nodes	Counts the number of isolated nodes on the current day. If an auto triggered node does not have an ancestor node, the auto triggered node becomes an isolated node. In this case, the node cannot be automatically scheduled to run.	Scenario: Isolated node
Frozen Nodes	Counts the number of auto triggered nodes that are frozen on the current day. If an auto triggered node is frozen, instances that are generated for the node are also frozen. Frozen instances are not automatically scheduled, and the descendant instances of the frozen instances are blocked from running.	Node freezing and unfreezing
Expired Nodes	Counts the number of auto triggered nodes for which the effective period of scheduling expires. The system generates instances for an auto triggered node and runs the instances within the effective period of scheduling of the node. If the effective period of scheduling expires, the system does not generate or schedule auto triggered instances of the node.	None
Modified Nodes	Counts the number of auto triggered nodes whose configurations are modified on the current day. The modifications include code changes, scheduling configuration modifications, node status changes, and node ownership changes. The statistics on the following nodes are collected: nodes whose configurations are modified on the DataStudio page and that are deployed to the production environment after configuration modification and nodes whose configurations are modified in the production environment. Note If you select Mine in the upper-right corner of the Focus On section, only the number of modified nodes of which you are the owner is counted.	None

View O&M information about auto triggered nodes and auto triggered instances

The following table lists the sections in which you can view O&M information about auto triggered nodes and auto triggered instances.

Section	Description	Illustration
Instances Status	Statistical scope: This section displays the statistics on the distribution of auto triggered instances by status based on a specific data timestamp. You can view the distribution of auto triggered instances in the current workspace or the distribution of auto triggered instances of which you are the owner. The statistics in this section are updated when you load the page. Method to view information: You can click a sector in the donut chart to view the number and proportion of auto triggered instances in a specific state. Instances that are in specific states and need to be focused on: Failed: An auto triggered instance in this state fails to run. As a result, the running of its descendant instances may be blocked. Frozen: A frozen auto triggered instance is not automatically scheduled, and the running of its descendant instances is blocked. Slow-running: An auto triggered instance is considered a slow-running instance if the auto triggered instance is running and the running duration of the auto triggered instance is at least `15` minutes longer than the average running duration of the auto triggered instance during the last `10` days. If the number of historical auto triggered instances is less than `four` and the running duration of an auto triggered instance exceeds 30 minutes, the auto triggered instance can also be considered a slow running instance. Note Only statistics on normal nodes are collected. Statistics on dry-run nodes and frozen nodes are not collected.
Instances Completion Status	Statistical scope: This section displays the completion situations of auto triggered instances between `00:00 and 23:00` of the current day, including the number of auto triggered instances that are successfully run or are not run on the current day, the number of auto triggered instances that were successfully run or were not run on the previous day, and the historical average number of auto triggered instances that are successfully run or are not run, as well as the fluctuation rate in the numbers. The statistics in this section are updated when you load this page. Display pattern: The statistics are displayed in a line chart. If the deviations among the three lines are large, an exception occurred during a specific period of time. You must perform a further check and analysis. Node type: You can select a node type based on your business requirements. Historical Average: This metric presents the completion situation of auto triggered instances that are successfully run in the previous `10` days.
Trend of Nodes and Instances	Statistical scope: This section displays the changing trends of the numbers of auto triggered nodes and auto triggered instances in the production environment within a specific period of time. You can specify a period of time within the previous 12 months in the upper-right corner of this section. Note
Distribution of Auto Triggered Nodes	Statistical scope: This section displays the number and proportion of auto triggered nodes counted by node type and scheduling cycle. The statistics in this section are updated when you load the page. Display pattern: The statistics are displayed in a donut chart. The types of nodes that can be displayed in the donut chart are limited. If the types of nodes that need to be displayed exceed the upper limit, the statistics are displayed after merging. Note If you select All My Workspaces in the top navigation bar of the Operation Center page, you can view the distribution of auto triggered nodes by workspace in this section.

View information in the Resource Usage in Resource Group for Scheduling section

This section displays the resource usage of a resource group for scheduling and the changing trend of the number of instances that are run on the resource group over a specific period of time. The chart in this section shows the percentage of resources used by the instances that are run on the specified resource group.

Note

This section displays statistics for a maximum of seven days.
If the resource usage of a resource group exceeds 80%, we recommend that you scale out the resource group to prevent insufficient resources from affecting the running of nodes.
The resource usage and the number of instances that are run on the resource group are collected at a level of resource group. For example, if multiple workspaces share the exclusive resource group for scheduling that you use, this section displays the resource usage and the changing trend of the number of instances that are run on the resource group in all the workspaces.

调度资源组使用情况

View the ranking of auto triggered instances on the previous day and the ranking of auto triggered instances with the highest error rate in the recent month

实例运行及出错排行

Ranking of Instances on Previous Day
This section ranks auto triggered instances based on their running duration, time spent in waiting for resources, and slow running duration on the previous day. Only the top 30 auto triggered instances are displayed. You can identify a time-consuming node based on the ranking, click the ID of the instance that is generated for the node to go to the instance details page, and then perform diagnostics on the instance to view the running situation of the instance.
Note
Slow Running: The difference between the running duration of an instance on the previous day and the average running duration of the instance over a historical period is collected. Instances are sorted by the difference in descending order.
Ranking of Auto Triggered Node Instances with Highest Error Rate in Recent Month
This section ranks nodes on which errors occurred within the recent month and displays the top 30 nodes. You can identify a node with a high error rate in the recent month, view the running details of the node, and then identify the cause of the error.

View O&M information about manually triggered nodes

On the Manually Triggered Node tab, you can view the O&M information about manually triggered workflows and instances in the workflows.

Overview

This section displays the numbers of manually triggered workflows and instances generated for manually triggered nodes in the workflows that are run on a specified date, and the proportion of auto triggered instances that are successfully run.

View information in the Business Process Instance State Distribution and Workflow Ranking sections

Section

Description

Illustration

Business Process Instance State Distribution

In this section, a donut chart is used to display the distribution of instances that are generated for manually triggered nodes and in different states in manually triggered workflows.

You can click a sector to go to the details page of the instances in a specific state. On the details page, you can view details of the instances and handle exceptions that occur on the instances. You need to pay special attention to failed instances.
This section displays statistics for a maximum of seven days.
After you click Mine in this section, the distribution of instances that belong to your account and are in specific states is displayed.

Workflow Ranking

This section displays the ranking of Top 30 workflows with the longest running durations and highest failure rates on a specific date.

You can quickly find a manually triggered workflow that is time-consuming or has a high failure rate based on the ranking, and click the workflow ID to go to the details page of the workflow. On the details page, you can perform diagnostics on a specific instance in the direct acyclic graph (DAG) of the workflow to obtain the running situations of instances in the workflow.
Only Top 30 workflows with the longest running durations and highest failure rates are displayed.

View information in the Internal Task Distribution and Internal task leaderboard sections

Section

Description

Illustration

Internal Task Distribution

In this section, a donut chart is used to display the distribution of the number of nodes in Operation Center in real time. You can view the statistics from the Node Type or Owner dimension.

Internal task leaderboard

This section displays the ranking of Top 30 nodes with the longest running durations and highest failure rates in manually triggered workflows on a specific date.

You can quickly find a node that is time-consuming or has a high failure rate based on the ranking, and click the node ID to go to the details page of the manually triggered workflow to which the instance generated for the node belongs. On the details page, you can perform diagnostics on the instance in the DAG of the workflow to obtain the running situation of the instance.
Only Top 30 nodes with the longest running durations and highest failure rates are displayed.

View O&M information about data synchronization nodes in Data Integration

On the Data Integration tab, you can view the overview information about data synchronization nodes in Data Integration and the resource usage situations of resource groups for Data Integration on the previous day and current day.

View the resource usage situations of resource groups for Data Integration

The Status of Resource Group for Data Integration section displays the resource usage situations of resource groups for Data Integration used by all data synchronization nodes in the current workspace. The resource usage situations include the number of nodes that are running on and are waiting for resources in each resource group for Data Integration, and the resource usage and expiration time of each resource group for Data Integration. You can determine whether you need to scale in or out a resource group for Data Integration based on the number of nodes that are running on and waiting for resources in and the resource usage of the resource group. This can facilitate reasonable resource allocation. 独享数据集成资源组使用情况

Note

For information about the operations that you can perform on an exclusive resource group for Data Integration, see Exclusive resource groups for Data Integration.
For information about the operations that you can perform on a serverless resource group, see the topics in the Use serverless resource groups directory.
The Data Integration tab of the O&M Dashboard page collects O&M statistics only on exclusive resource groups for Data Integration.

View the distribution of data synchronization nodes created in Data Integration by status

The Running Status Distribution section displays the distribution of data synchronization nodes by status in the current workspace in a donut chart. You can click a sector to go to the details page of the nodes in a specific state. On the details page, you can view details of the nodes and handle exceptions that occur on the nodes. You must take note of the nodes that are in the Abnormal and Running Failed states. The nodes in these states block the running of their descendant nodes. 运行状态分布

View the statistics on batch synchronization nodes

The following table lists the sections in which you can view the statistics on batch synchronization nodes.

Section	Description	Illustration
Data Synchronization Progress	This section displays information about the data that is involved in batch synchronization within a specified period of time. The metrics include Total Amount of Data, Total Internet Traffic, and Total Data Records.
Statistics on Amount of Synchronized Data	This section displays the curves of the data that is read from or written to different data sources within a specified period of time. In this section, you can view the nodes of a specific type of compute engine that are run to synchronize a large amount of data. You can allocate an excess of resources for the nodes.
Latest Top 10 Tasks	This section displays the latest 10 instances that failed to run and the latest 10 instances that are successfully run. The statistics provide you with an overview of the latest instance status. You can quickly identify the cause of an instance failure and fix the error based on the error message.
Running Details of Synchronization Task	This section allows you to specify filter conditions to search for nodes. The filter conditions include Committed At, Task Status, and Node Name. You can click the ID of a node to view the details of the node.

View the statistics on real-time synchronization nodes

The following table lists the sections in which you can view the statistics on real-time synchronization nodes.

Section	Description	Illustration
Overview	This section displays the total data transmission speed and total recording speed of all real-time synchronization nodes in the current workspace.
Top 10 Tasks with Highest Latency	This section displays the top 10 nodes that have the highest latency. In this section, you can quickly identify nodes that have high latency and optimize the performance of the nodes at the earliest opportunity.
Alert Information	This section displays information about the latest alerts. This section allows you to quickly identify exceptions and handle the exceptions at the earliest opportunity.
Failover Information	This section displays information about `failovers` within a specified period of time. This section provides you with an overview of `failovers`. For information about `failovers`, see Manage real-time synchronization tasks.