Automated governance of materialized views based on intelligent recommendations -

DataWorks supports automated governance of materialized views based on intelligent recommendations. This is an intelligent and automated solution for frequent big data computing tasks that contain a large number of similar subqueries. If you enable the intelligent recommendation feature on materialized views, DataWorks can automatically identify and classify similar subqueries in MaxCompute and generate recommendations for creating materialized views. You can create a materialized view with a few clicks based on your business requirements. This significantly improves computing efficiency and saves computing resources.

Features

The feature of automated governance of materialized views in DataWorks relies on the materialized view feature of MaxCompute. You can use the feature of automated governance of materialized views to quickly scan and identify a large number of similar subqueries that exist in MaxCompute SQL statements in DataWorks scheduling scenarios. DataWorks then uses similar subqueries to recommend the view generation SQL statements that can be used to efficiently create materialized views.

After a view is created, DataWorks synchronously creates a node that generates the view data and deploys the node to the production environment for scheduling. Nodes that originally have similar subqueries in DataWorks are uniformly specified as the descendant nodes of the node that generates the view data. DataWorks preferentially schedules the node that generates the view data based on the scheduling dependencies. When the descendant nodes use the public subquery, the nodes query data from the materialized view that has generated data. This reduces the repetitive calculation of data for governance purposes.

The feature of automated governance of materialized views in DataWorks is available in the following regions: China (Hangzhou), China (Shanghai), China (Shenzhen), China (Beijing), and China (Chengdu).

Prerequisites

A MaxCompute data source is added. For more information, see Add a MaxCompute data source.
The intelligent analysis feature on materialized views is enabled in MaxCompute. For more information, see Recommendations and management of materialized views.

Use process

Enable the intelligent recommendation feature on materialized views.
After you enable the intelligent recommendation feature on materialized views for a MaxCompute project that is associated with DataWorks Data Studio, the recommendations for creating materialized views are generated and displayed on the Materialized View Recommendation tab if the following conditions are met:
- An auto triggered task is run for at least three consecutive days.
- The number of rows of input data in a public subquery is greater than 1,000,000.
- The public subquery contains operators that are related to data reorganization, such as join or aggregation operators.
View the recommendations and create a materialized view.
You can check whether the recommendations are valid and create and manage a materialized view.
If you configure the parameters in the Create Materialized View Refresh Node and Add Materialized View Refresh Node as Ancestor Node sections when you create a materialized view, the created materialized view takes effect in scheduling scenarios in DataWorks.
A materialized view refresh node is automatically added as an ancestor node of nodes that have similar computing logic. When an ancestor node or the node that generates the source table generates new output, the materialized view is refreshed to produce the latest data for descendant nodes to query.
Manage materialized views.
You can view the usage of the materialized views in the current workspace and analyze or delete the materialized views whose output data does not meet expectations.

Procedure

Step 1: Enable or disable the intelligent recommendation feature on materialized views

Note

Only users that are assigned the Workspace Administrator role can enable or disable the intelligent recommendation feature on materialized views.

Go to the Data Governance Center page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Governance Center. On the page that appears, click Go to Data Governance Center.
In the top navigation bar of the Data Governance Center page, click Workbench. In the left-side navigation pane, click Materialized View.
On the Materialized View page, select a workspace from the Workspace drop-down list and click Settings for Materialized View Recommendation. In the Settings for Materialized View Recommendation dialog box, turn on the switch in the Intelligent Materialized View Recommendation column for the projects for which you want to enable the intelligent recommendation feature on materialized views.

Step 2: View the recommended materialized views

The materialized view recommendation module can automatically identify materialized views that can be created and the associated computing tasks. The related owners can determine whether the materialized views are valid and manage their creation process.

Important

To ensure accurate recommendations, the system generates and displays relevant recommendations on the Materialized View Recommendation tab only when the following conditions are met after you enable the intelligent recommendation feature on materialized views:

An auto triggered task is run for at least three consecutive days.
The number of rows of input data in a public subquery is greater than 1,000,000.
The public subquery contains operators that are related to data reorganization, such as join or aggregation operators.

Select a workspace from the Workspace drop-down list in the top navigation bar and click the Materialized View Recommendation tab.
Modify the filter conditions, such as Project and Analysis Time Interval, to check whether recommendations for creating materialized views are generated.
If recommendations for creating materialized views are available in the time period specified by the Analysis Time Interval parameter, you can view the recommendations.
Parameters:
- Materialized View Recommendation: the identifier of a materialized view recommendation task. The identifier includes the summary information about the public subquery. Format of an identifier: Aggregate: xx; Inputs: xx, xx;.
  - Aggregate: indicates whether the recommended public subquery contains the aggregate operation.
  - Inputs: lists the names of all source tables used by the public subquery.
- Number of Possibly Affected SQL Statements: the number of jobs that use the public subquery.
- Possibly Affected Node: the number of scheduling nodes that use the public subquery.
- Recommendation Rating: the extent of recommendation, which is determined by the repetition of subqueries, subquery complexity, and the number of input data records.
Click the identifier of a materialized view recommendation task in the Materialized View Recommendation column to view the details of the public subquery. The details include the source table information, the public subquery, jobs, and scheduling tasks.

Step 3: Create a materialized view

On the Materialized View Recommendation tab, determine whether to create a materialized view based on your business requirements and recommendation details. To create a materialized view, find the recommendation based on which you want to create a materialized view and click Create Materialized View in the Actions column.

On the Create Materialized View page, configure the parameters for the materialized view.

In the Create Materialized View section, create a materialized view refresh node and generate a materialized view based on the public subquery of input tasks.

Parameter	Description
Input Table	The input tables, which are automatically obtained and do not need to be modified.
Create Materialized View or Not	The first time you create a materialized view, the default value Create Materialized View is used. You do not need to change the value. Note If a materialized view for which the same input tables are selected exists, you can set the parameter to Select Existing Materialized View based on your business requirements.
Materialized View Name	The name of the materialized view. You can specify a name based on your business requirements.
Lifecycle	The lifecycle of the materialized view. Valid values: 1 Day, 7 Days, and Custom Days.
Refresh Type	The refresh mode of the materialized view. Refresh at Fixed Interval If the table from which you want to read data is the output of a task in DataWorks, Time Management Cron Mode is selected by default. Otherwise, Refresh at Fixed Interval is selected. Time Management Cron Mode The materialized view is refreshed at the most recent output time of the node that generates the table from which you want to read data. Do Not Refresh
Preview Statement	The SQL statements of the materialized view.

Optional. In the Create Materialized View Refresh Node section, create a materialized view refresh node. When the node that generates the source table generates new output, the materialized view refresh node dynamically refreshes the materialized view to materialize the latest data.
Configure the Materialized View Refresh Node Name and Node Running Timeout Period parameters.
Optional. In the Add Materialized View Refresh Node as Ancestor Node section, add the materialized view refresh node as an ancestor node of nodes that have similar computing logic to increase the query hit rate of incremental data.
The computing tasks displayed in this section include tasks in the current workspace and tasks in other workspaces. You can click View Details to view the details of the computing tasks.

Click Create and Execute. DataWorks starts to create a materialized view. You can view the status of the materialized view in the Operation Status column on the Materialized View Recommendation tab, or click View Details in the Actions column to view the creation progress.
- Valid values of the Operation Status parameter:
  - To Be Created: The materialized view is not created by using DataWorks.
  - Created: The materialized view is created.
  - Creating: The process of creating the materialized view by using DataWorks is initiated but the creation is not finished.
  - Create Failed: The process of creating the materialized view by using DataWorks is initiated but the creation failed.
- You can click View Details to view the details of the creation process of the materialized view.

If you configure the parameters in the Create Materialized View Refresh Node and Add Materialized View Refresh Node as Ancestor Node sections when you create a materialized view, a materialized view refresh node is added as an ancestor node of nodes that have similar computing logic. The owner of the materialized view refresh node is the account that is used to create the materialized view.

Step 4: Manage materialized views

You can view the usage of the materialized views in the current workspace and analyze or delete the materialized views whose output data does not meet expectations.

Select a workspace from the Workspace drop-down list in the top navigation bar and click the Materialized View Management tab.
Modify the filter conditions, such as Project, to view the materialized views that are created by using the recommendations.
Important
If your MaxCompute project uses the pay-as-you-go billing method for its quota, the total saved compute capacity is calculated by using the following formula: Amount of input data × SQL complexity. The fee for MaxCompute SQL jobs is calculated by using the following formula: Compute capacity × Unit price. For more information, see Computing pricing.
- Click the name of a materialized view in the Materialized View column to view the details of the materialized view.
- Click Delete in the Actions column of a materialized view to delete the materialized view.

Configuration example

Note

The test data used in this example is from a MaxCompute public dataset. For more information, see TPC-DS data. You can use the tables related to your business data for configuration.

Prepare data

Add a MaxCompute data source and associate the data source with DataWorks Data Studio.
Create a zero load node named Materialized_View_Test in Data Studio and specify the zero load node as the root node of descendant nodes.

In this example, create four ODPS nodes in Data Studio.

The dws_customer_info_1 and dws_customer_info_2 nodes use the public subquery and obtain data from the ods_mc_tpc_customer and ods_mc_tpc_customer_address nodes.

ods_mc_tpc_customer

The parameter name of a scheduling parameter is bday and the value of the scheduling parameter is $[yyyymmdd].

CREATE TABLE IF NOT EXISTS ods_mc_tpc_customer
(
    c_customer_sk           BIGINT NOT NULL
    ,c_customer_id          CHAR(16) NOT NULL
    ,c_current_cdemo_sk     BIGINT
    ,c_current_hdemo_sk     BIGINT
    ,c_current_addr_sk      BIGINT
    ,c_first_shipto_date_sk BIGINT
    ,c_first_sales_date_sk  BIGINT
    ,c_salutation           CHAR(10)
    ,c_first_name           CHAR(20)
    ,c_last_name            CHAR(30)
    ,c_preferred_cust_flag  CHAR(1)
    ,c_birth_day            BIGINT
    ,c_birth_month          BIGINT
    ,c_birth_year           BIGINT
    ,c_birth_country        VARCHAR(20)
    ,c_login                CHAR(13)
    ,c_email_address        CHAR(50)
    ,c_last_review_date_sk  CHAR(10)
)
COMMENT 'TABLE COMMENT'
PARTITIONED BY (ds STRING COMMENT 'Partition')
LIFECYCLE 90;

SET odps.namespace.schema=TRUE;
INSERT OVERWRITE TABLE ods_mc_tpc_customer PARTITION(ds='${bday}')
select * from BIGDATA_PUBLIC_DATASET.TPCDS_10T.customer;

ods_mc_tpc_customer_address

The parameter name of a scheduling parameter is bday and the value of the scheduling parameter is $[yyyymmdd].

CREATE TABLE IF NOT EXISTS ods_mc_tpc_customer_address
(
    ca_address_sk           BIGINT NOT NULL
    ,ca_address_id          CHAR(16) NOT NULL
    ,ca_street_number       CHAR(10)
    ,ca_street_name         VARCHAR(60)
    ,ca_street_type         CHAR(15)
    ,ca_suite_number        CHAR(10)
    ,ca_city                VARCHAR(60)
    ,ca_county              VARCHAR(30)
    ,ca_state               CHAR(2)
    ,ca_zip                 CHAR(10)
    ,ca_country             VARCHAR(20)
    ,ca_gmt_offset          DECIMAL(5,2)
    ,ca_location_type       CHAR(20)
)
COMMENT 'TABLE COMMENT'
PARTITIONED BY (ds STRING COMMENT 'Partition')
LIFECYCLE 90;

SET odps.namespace.schema=TRUE;
INSERT OVERWRITE TABLE ods_mc_tpc_customer_address PARTITION(ds='${bday}')
select * from BIGDATA_PUBLIC_DATASET.TPCDS_10T.customer_address;

dws_customer_info_1

The parameter name of a scheduling parameter is bday and the value of the scheduling parameter is $[yyyymmdd].

CREATE TABLE IF NOT EXISTS dws_customer_info_1
(
    c_customer_sk           BIGINT NOT NULL
    ,c_customer_id          CHAR(16) NOT NULL
    ,c_first_name           CHAR(20)
    ,c_last_name            CHAR(30)
    ,c_email_address        CHAR(50)
    ,ca_suite_number        CHAR(10)
    ,ca_city                VARCHAR(60)
    ,ca_county              VARCHAR(30)
    ,ca_state               CHAR(2)
)
COMMENT 'TABLE COMMENT'
PARTITIONED BY 
(
    ds                      STRING COMMENT 'Partition'
)
LIFECYCLE 90
;

INSERT OVERWRITE TABLE dws_customer_info_1 PARTITION (ds = '${bday}')
SELECT  t02.c_customer_sk
        ,t02.c_customer_id
        ,t02.c_first_name
        ,t02.c_last_name
        ,t02.c_email_address
        ,t03.ca_suite_number
        ,t03.ca_city
        ,t03.ca_county
        ,t03.ca_state
FROM    (
            SELECT  *
            FROM    ods_mc_tpc_customer_address 
            WHERE   ds = '${bday}'
        ) t03
JOIN    (
            SELECT  *
            FROM    ods_mc_tpc_customer
            WHERE   ds = '${bday}'
        ) t02
ON      t03.ca_address_sk = t02.c_current_addr_sk
;

dws_customer_info_2

The parameter name of a scheduling parameter is bday and the value of the scheduling parameter is $[yyyymmdd].

CREATE TABLE IF NOT EXISTS dws_customer_info_2
(
    c_customer_sk           BIGINT NOT NULL
    ,c_customer_id          CHAR(16) NOT NULL
    ,c_first_name           CHAR(20)
    ,c_last_name            CHAR(30)
    ,c_email_address        CHAR(50)
    ,ca_suite_number        CHAR(10)
    ,ca_city                VARCHAR(60)
    ,ca_county              VARCHAR(30)
    ,ca_state               CHAR(2)
)
COMMENT 'TABLE COMMENT'
PARTITIONED BY 
(
    ds                      STRING COMMENT 'Partition'
)
LIFECYCLE 90
;

INSERT OVERWRITE TABLE dws_customer_info_2 PARTITION (ds = '${bday}')
SELECT  t02.c_customer_sk
        ,t02.c_customer_id
        ,t02.c_first_name
        ,t02.c_last_name
        ,t02.c_email_address
        ,t03.ca_suite_number
        ,t03.ca_city
        ,t03.ca_county
        ,t03.ca_state
FROM    (
            SELECT  *
            FROM    ods_mc_tpc_customer_address 
            WHERE   ds = '${bday}'
        ) t03
JOIN    (
            SELECT  *
            FROM    ods_mc_tpc_customer
            WHERE   ds = '${bday}'
        ) t02
ON      t03.ca_address_sk = t02.c_current_addr_sk
;

Save, commit, and then deploy the nodes to the production environment. You can view the dependencies of the nodes in Operation Center.

Enable materialized views

Enable the intelligent analysis feature on materialized views in MaxCompute. For more information, see Recommendations and management of materialized views.
Enable the intelligent recommendation feature on materialized views in DataWorks Data Governance Center. For more information, see Step 1: Enable or disable the intelligent recommendation feature on materialized views.
After the auto triggered tasks are run for more than three days, you can view the recommendations for creating materialized views in DataWorks Data Governance Center.
Note
The test data used in this example is 10-TB TPC-DS datasets, which meet the requirements for generating materialized view recommendations. If you do not view the generated recommendations on the Materialized View Recommendation tab, check whether your data meets the requirements.
- An auto triggered task is run for at least three consecutive days.
- The number of rows of input data in a public subquery is greater than 1,000,000.
- The public subquery contains operators that are related to data reorganization, such as join or aggregation operators.

Create a materialized view

On the Materialized View Recommendation tab, find a recommendation and click Create Materialized View in the Actions column. For more information, see Step 3: Create a materialized view.
Note
The parameters in the Create Materialized View Refresh Node and Add Materialized View Refresh Node as Ancestor Node sections are configured by default and do not need to be modified.
Click Create and Execute. On the Materialized View Recommendation tab, you can view the creation details.
After the materialized view is created, you can see that the materialized view is used in scheduling nodes when the nodes are run for the second time. For duplicate subqueries, the nodes query data from the related materialized view refresh node.

References

For information about how to create a materialized view in MaxCompute, see Materialized view operations.
For information about how to enable the intelligent analysis feature on materialized views in MaxCompute, see Recommendations and management of materialized views.