Unlock the Power of AI

1 million free tokens

88% Price Reduction

NaNDayNaN:NaN:NaN
Activate Now

Getting started with Operation Center

Updated at: 2025-01-07 06:12

After a node is committed and deployed in the production environment, you can view the node and perform O&M operations on the node in Operation Center. For example, you can test the node and backfill data for the node. This topic describes the basic operations that you can perform on an auto triggered node in Operation Center. You can check whether the configurations of the node meet your requirements, backfill data in a historical period of time for the node, and configure an alert rule for the node to ensure that the node can be scheduled as expected in the future.

Prerequisites

A node named result_table is created and deployed by performing the operations described in Data development: Developers.

Note

This topic uses the result_table node to describe O&M operations. You can perform O&M operations on your node in the same manner.

Background information

In Operation Center, you can perform O&M operations on different types of nodes, such as auto triggered nodes, manually triggered nodes, and real-time synchronization nodes. You can also use different monitoring methods to monitor various objects such as nodes and resources that are used by the nodes. This helps you identify and handle exceptions at the earliest opportunity based on alerts and ensures efficient and stable data generation.

This topic describes only the basic operations that you can perform in Operation Center. You can choose to perform advanced O&M operations based on your business requirements. Examples:

For more information about Operation Center, see Overview.

Go to the Operation Center page

Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Development and Governance > Operation Center in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

Procedure

Phase 1: Test and verify the node

  1. Step 1: View the configurations of the node

    After you commit and deploy the node in the production environment, we recommend that you go to the Operation Center page to check whether the configurations of the node meet your requirements. The configurations include the scheduling parameters and the resource group for scheduling. If the configurations do not meet your requirements, modify the configurations and deploy the node again.

  2. Step 2: Test the node

    Check whether the node is run as expected in the production environment by using the smoke testing feature. If an error occurs during the node execution, handle the error at the earliest opportunity to ensure that the node is run as expected.

  3. Step 3: Backfill data in a historical period of time for the node

    You can backfill data in a historical period of time for the node.

  4. Step 4: View the auto triggered node instances generated for the node

    After you commit and deploy the node on a day, auto triggered node instances are generated for the node based on the scheduling cycle that you specified for the node. If you set the Instance Generation Mode parameter to Next Day for the node, the auto triggered node instances start to be scheduled on the next day. If you set the Instance Generation Mode parameter to Immediately After Deployment, the auto triggered node instances start to be scheduled on the current day. You can view the auto triggered node instances generated for the node and the status of the instances to check whether the node is scheduled as expected.

  5. Step 5: View data write results

    After you test the node or backfill data for the node, you can view data write results.

Phase 2: Monitor the node

  1. Step 6: Create a custom alert rule

    You can use the intelligent monitoring feature to configure an alert rule for the node based on your business requirements to monitor the scheduling status of the node to ensure that the node is scheduled as expected.

  2. Step 7: Create a baseline (advanced feature)

    To ensure that the node with a higher priority generates data at the specified time, you can create and configure a baseline and add the node to the baseline. This way, if the system detects that the node may fail to finish running before the specified time, the system sends you a notification that describes the exception about the node. This helps you identify and handle the exception at the earliest opportunity.

  3. Step 8: Create a custom alert rule for a resource group and associate an automated O&M rule with the custom alert rule

    You can create a custom alert rule for an exclusive resource group. When you configure the custom alert rule, you can specify the alert conditions such as the resource usage of the exclusive resource group and the maximum number of instances that are waiting for resources in the resource group. If the custom alert rule is triggered, the system performs O&M operations based on the automated O&M rule that you associate with the custom alert rule.

Step 1: View the configurations of the node

After you commit and deploy the node in the production environment, we recommend that you go to the Operation Center page to check whether the configurations of the node meet your requirements. The configurations include the scheduling parameters and dependencies.

  1. Go to the Operation Center page.

  2. Find the desired node.

    1. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes.

    2. On the page that appears, search for the node.

  3. View the node details.

    1. Click the node name. The directed acyclic graph (DAG) of the node appears.

    2. Click Show Details in the lower-right corner of the DAG of the node to view the node details.

Note

In this example, you can find the deployed result_table node in the node list and check whether the scheduling parameters and the resource group for scheduling are correctly configured for the node.查看周期任务配置

Step 2: Test the node

You can use the smoke testing feature to check whether the node is run as expected in the production environment. The code of the node is run during the test.

  1. Open the Test dialog box.

    You can use one of the following methods to open the Test dialog box:

    • Method 1: In the node list, find the desired node and click Test in the Actions column.

    • Method 2: In the DAG of the node, right-click the node name and select Test.

  2. In the Test dialog box, configure the data timestamp and the time at which the node is run and click OK.

    When you test the node, a test instance is generated for the node. You can view the status of the test instance on the Test Instances page.

    Note

In this example, the smoke testing feature is used to check whether the result_table node is run as expected in the production environment. You can test the node and view the execution status of the test instance generated for the node by performing the operations shown in the following figure.测试周期任务

Step 3: Backfill data in a historical period of time for the node

After you develop the node, and commit and deploy the node in the production environment, the node is scheduled based on the scheduling settings. If you want to calculate data in a historical period of time for the node again, you can use the data backfill feature to backfill the data for the node.

  1. Go to the Backfill Data panel.

    You can use one of the following methods to go to the Backfill Data panel:

    • Method 1: In the node list, find the desired node and click Backfill Data in the Actions column.

    • Method 2: In the DAG of the node, right-click the node name and select Run.

  2. Select a mode in which you want to backfill data for the node.

    You can select a data backfill mode based on your business requirements.

    Node selection method

    Description

    Scenario

    Node selection method

    Description

    Scenario

    Manually Select

    Select one or more nodes as root nodes. This way, you can manually select specific descendant nodes of the root nodes for which you want to backfill data.

    Note
    • The original plans of backfilling data for the current node, backfilling data for the current node and its descendant nodes, and backfilling data in advanced mode are compatible with this method.

    • You can select up to 500 root nodes and up to 2,000 total nodes. The total nodes consist of root nodes and their descendant nodes.

    • This method can be used to backfill data for the current node and its descendant nodes at a time.

    • This method can be used to backfill data for multiple nodes that may not have dependencies with each other at a time.

    Select by Link

    Select a start node as the root node and one or more end nodes. Then, the system automatically determines that all nodes from the start node to the end node require data backfilling.

    This method can be used to perform end-to-end data backfilling for nodes for which complex dependencies are configured.

    Select by Workspace

    Select a node as the root node, and determine the nodes for which you want to backfill data based on the workspaces to which descendant nodes of the root node belong.

    Note
    • The original plan of backfilling data for massive nodes is compatible with this method. You can select up to 20,000 nodes.

    • You cannot configure a node blacklist.

    This method is suitable for scenarios in which descendant nodes of the current node belong to different workspaces and you want to backfill data for the descendant nodes.

    Specify Task and All Descendant Tasks

    Select a root node. Then, the system automatically determines that the root node and all its descendant nodes require data backfilling.

    Important

    You can view the nodes that are triggered to run only if the data backfill task is running. Proceed with caution.

    This method can be used to backfill data for a root node and all its descendant nodes.

  3. Configure the data backfill parameters.

    For example, you can configure the data timestamp and the node for which you want to backfill data based on your business requirements. The data backfill parameters vary based on the data backfill mode. For more information, see Backfill data and view data backfill instances (new version).

In this example, the data backfill mode Backfill Data for Current Node is selected. Data generated in the time period from 00:00 to 01:00 every day from 2024-09-17 to 2024-09-19 is backfilled for the result_table node. You can backfill data for the node by performing the operations shown in the following figure.

Note

After the data backfill, the scheduling system replaces the variables in the code of the node with actual values based on the scheduling parameters and data timestamp that you specified.

补数据

Step 4: View the auto triggered node instances generated for the node

After you commit and deploy the node on a day, auto triggered node instances are generated for the node based on the scheduling cycle that you configured for the node. If you set the Instance Generation Mode parameter to Next Day for the node, the auto triggered node instances start to be scheduled on the next day. If you set the Instance Generation Mode parameter to Immediately After Deployment, the auto triggered node instances start to be scheduled on the current day. You can view the auto triggered node instances generated for the node to check whether the node is scheduled as expected.

  1. Go to the Auto Triggered Instances page.

    In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Instances.

  2. View the auto triggered node instances generated for the node.

    Check whether the auto triggered node instances are generated for the node based on the scheduling settings and check whether the instances are run as expected. For more information about auto triggered node instances, see View auto triggered instances.

    If an auto triggered node instance generated for the node is in the Pending (Ancestor) state, you can troubleshoot the issue by performing the following operations:

    1. Use the upstream analysis feature provided in the DAG of the instance to quickly identify ancestor instances that block the running of the current instance.

    2. Use the intelligent diagnosis feature to diagnose failure causes or related issues of the ancestor instances. The intelligent diagnosis feature can also be used to quickly troubleshoot issues if dependencies between the current instance and ancestor instances are complex. This improves O&M efficiency.运行诊断示例

In this example, you can view the status of the auto triggered node instances generated for the result_table node on September 19, 2024. The result_table node is scheduled by hour.查看周期实例

Step 5: View data write results

After you test the node or backfill data for the node, you can use one of the following methods to view data write results:

  • View data write results in Data Map.

    You can go to the homepage of Data Map and search for the desired table on this page to go to the details page of the table. Then, you can check whether data written to the table is correct on the details page of the table. For more information about how to search for a table, see Search for tables. For more information about how to view the details of a table, see View the details of a table.

  • Create an ad hoc query in the Ad Hoc Query pane of the DataStudio page to view data write results.

    If you need to query the data and related SQL code, and check whether the running result of test code is consistent with the expected result and whether the code is valid only on the DataStudio page, which is the development environment, you can create an ad hoc query. In this case, you do not need to deploy the data or SQL code to the production environment and perform operations on compute engines in the production environment.

Note
  • By default, a RAM user does not have the required permissions to query MaxCompute tables in the production environment. If you want to query a MaxCompute table in the production environment as a RAM user, you can go to the details page of the table in Data Map to request the query permissions. For more information, see Request permissions on tables.

  • When the node is run on the DataStudio page, data is written to a project in the development environment. After the node is deployed in the production environment, data is written to a project in the production environment. When you query table data, confirm the environment of the project to which the table belongs. You can go to the Computing Reource page in DataStudio to view the information about a project.

  • MaxCompute allows you to access tables across projects. For example, you can access tables across MaxCompute projects that are associated with your workspace, and you can access tables in a project in the production environment from the development environment. Some other types of compute engines do not allow you to access tables across projects. The features of a compute engine determine whether you can access tables across projects.

In this example, the result_table node is in the workspace that corresponds to the MaxCompute project named mc_test_project in the production environment. You can create an ad hoc query node of the ODPS SQL type and execute SQL statements to query the partition data in the mc_test_project.result_table table.查看表数据

Step 6: Create a custom alert rule

After the node is tested and verified, you can create a custom alert rule for the node to monitor the status of the node. If an exception occurs during the node running, the system sends you an alert notification based on the alert configurations. This helps you identify and handle the exception at the earliest opportunity and ensures that the node can be scheduled in the future.

  1. Go to the Operation Center page.

  2. In the left-side navigation pane, choose Alarm > Rule Management.

  3. Create a custom alert rule.

    1. On the page that appears, click Create Custom Rule.

    2. Configure the parameters for the rule.

      You can configure the custom alert rule based on your business requirements. For more information, see Create a custom alert rule.

      In this example, a custom alert rule is configured for the result_table node. An alert notification is sent if the node fails to run. You can configure the custom alert rule based on your business requirements by configuring parameters shown in the following figure. 自定义规则The Test rules custom alert rule is triggered if the result_table node fails to run. An alert notification is sent to the node owner by text message. The alert notification can be sent for a maximum of three times at an interval of 30 minutes.

      Note

      You must configure the information about the alert contact in advance. For more information, see Configure and view alert contacts.

Step 7: Create a baseline (advanced feature)

To ensure that the node generates data at the specified time, you can create and configure a baseline and add the node to the baseline. Then, you can configure the priority and committed completion time for the baseline. DataWorks calculates the estimated completion time of the node based on the execution situations of the node during a historical period of time. Nodes with a higher priority in the baseline can preferentially use scheduling resources. If the system detects that the node may fail to finish running before the committed completion time, the system sends you an alert notification. You can troubleshoot the issue based on the alert.

  1. Go to the Operation Center page.

  2. In the left-side navigation pane, click Smart Baseline.

  3. Create a baseline.

    1. On the Baselines tab, click Create Baseline.

    2. Configure the parameters for the baseline.

      You can configure the baseline based on your business requirements. For more information, see the Create a baseline section of the "Manage baselines" topic.

      In this example, an hour-level baseline is configured and the result_table node is added to the baseline. The baseline can monitor data generation of the node each hour. You can configure the baseline by configuring parameters shown in the following figure. 配置智能基线Descriptions of some parameters:

      • Priority: A large value indicates a high priority. A node with a higher priority in a baseline can preferentially use scheduling resources when the resources are in shortage.

      • Estimated Finish Time:: The system estimates the time at which a node finishes running based on the completion time of the node in a historical period of time.

      • Committed Finish Time: You can specify the latest time at which a node must generate data. You can configure this parameter based on your business requirements and the completion time of the node in a historical period of time.

      • Alert Margin Threshold: You can configure this parameter based on the Committed Finish Time parameter. This parameter allows you to have time to handle node exceptions to ensure that the node can finish running at the committed completion time.

        Note

        The time difference between the alert time and committed completion time must be at least 5 minutes.

      If the instances generated for the result_table node cannot finish running within 30 minutes of each hour, the Test Baselines baseline alert is triggered, and an alert notification is sent to the node owner by text message. The alert notification can be sent for a maximum of three times at an interval of 30 minutes.

Step 8: Create a custom alert rule for a resource group and associate an automated O&M rule with the custom alert rule

If you use an exclusive resource group to run nodes, you can create a custom alert rule for the resource group and associate an automated O&M rule with the custom alert rule to enable automated O&M based on your business requirements. When you configure the custom alert rule, you can specify the alert conditions such as the resource usage of the exclusive resource group and the maximum number of instances that are waiting for resources in the resource group. If the custom alert rule is triggered, the system performs O&M operations based on the automated O&M rule that you associate with the custom alert rule.

The automated O&M feature works by associating an automated O&M rule with a custom alert rule that is configured for an exclusive resource group. You can specify the alert conditions for nodes that are run on the exclusive resource group in the custom alert rule and configure the automated O&M rule based on your business logic. If an auto triggered node instance that meets the filter conditions specified in the automated O&M rule hits the alert conditions, the custom alert rule is triggered and automated O&M operations are performed.

Note
  • Only exclusive resource groups for scheduling support the automated O&M feature.

  • To prevent slow node running due to insufficient resources, you can move your node to an exclusive resource group for scheduling for running. For information about how to change the resource group used by nodes, see General reference: Change the resource groups used by tasks.

  1. Go to the Operation Center page.

  2. Create a custom alert rule for a resource group.

    1. In the left-side navigation pane, choose Alarm > Rule Management.

    2. Create and configure a custom alert rule for a resource group.

      The alert rule configurations for a resource group are similar to those for a node. The only difference is that you need to set the Object Type parameter to Exclusive Resource Group for Scheduling. For more information, see Create a custom alert rule.

      In this example, a custom alert rule is configured for the Exclusive_Scheduling_Resource resource group to monitor the resource usage in the resource group. The following figure shows the parameters that you need to configure.

      Note

      This topic provides only a configuration example for you. You can configure a custom alert rule for your resource group based on your business requirements.

      资源组监控规则The Resource group monitoring rules alert rule is triggered when the resource usage in the Exclusive_Scheduling_Resource resource group is greater than 90% for 10 minutes. The system sends an alert notification to a specified alert contact by text message. The alert notification can be sent for a maximum of three times.

  3. Configure an automated O&M rule based on the custom alert rule that is configured for the resource group.

    1. In the left-side navigation pane, choose O&M Assistant > Automatic.

    2. On the Rules tab of the page that appears, click Create Rule.

    3. Configure the parameters for the rule.

      You can configure the automated O&M rule based on your business requirements. For more information, see Automated O&M.

      In this example, an automated O&M rule named Automatic_test is created and the automated O&M rule is associated with the Resource group monitoring rules custom alert rule that is configured for the exclusive resource group for scheduling named Exclusive_Scheduling_Resource. If the custom alert rule is triggered, DataWorks performs O&M operations on the instances that meet the filter conditions specified in the automated O&M rule. The following figure shows the parameters that you need to configure. 自动运维Descriptions of some parameters:

      • Associated Monitoring Rule: You can associate the current automated O&M rule only with a custom alert rule that is configured for an exclusive resource group for scheduling. You must create a custom alert rule and set the Object Type parameter to Exclusive Resource Group for Scheduling in advance.

      • O&M Operation: You can set this parameter only to Terminate Running Instance. After the automated O&M rule is triggered, node instances that meet the filter conditions are terminated.

      In this example, when the resource usage in the Exclusive_Scheduling_Resource resource group is greater than 90% for 10 minutes, DataWorks terminates the instances whose priority is 1 and scheduling cycle is hour or minute from all auto triggered node instances, test instances, and data backfill instances that are run on the resource group in a specified workspace.

Manage and control O&M operations (advanced feature)

In Operation Center, you can perform operations on a node. For example, you can freeze, unfreeze, backfill data for, or undeploy a node. The different types of operations can be considered as extension point events. You can use extension point events with extensions to customize the processing logic for and O&M operations on nodes. For more information, see Extension overview and Trigger event checking in Operation Center.

What to do next

You can configure data quality monitoring rules for the table data that is generated by the node to ensure that the data output meets your expectations. For more information, see Data Quality.

  • On this page (1)
  • Prerequisites
  • Background information
  • Go to the Operation Center page
  • Procedure
  • Phase 1: Test and verify the node
  • Phase 2: Monitor the node
  • Step 1: View the configurations of the node
  • Step 2: Test the node
  • Step 3: Backfill data in a historical period of time for the node
  • Step 4: View the auto triggered node instances generated for the node
  • Step 5: View data write results
  • Step 6: Create a custom alert rule
  • Step 7: Create a baseline (advanced feature)
  • Step 8: Create a custom alert rule for a resource group and associate an automated O&M rule with the custom alert rule
  • Manage and control O&M operations (advanced feature)
  • What to do next
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare