DataWorks supports automated governance of materialized views based on intelligent recommendations. This is an intelligent and automated solution for frequent big data computing tasks that contain a large number of similar subqueries. If you enable the intelligent recommendation feature on materialized views, DataWorks can automatically identify and classify similar subqueries in MaxCompute and generate recommendations for creating materialized views. You can create a materialized view with a few clicks based on your business requirements. This significantly improves computing efficiency and saves computing resources.
Feature description
The feature of automated governance of materialized views in DataWorks relies on the materialized view feature of MaxCompute. You can use the feature of automated governance of materialized views to quickly scan and identify a large number of similar subqueries that exist in MaxCompute SQL statements in DataWorks scheduling scenarios. DataWorks then uses similar subqueries to recommend the view generation SQL statements that can be used to efficiently create materialized views.
After a view is created, DataWorks synchronously creates a node that generates the view data and deploys the node to the production environment for scheduling. Nodes that originally have similar subqueries in DataWorks are uniformly specified as the descendant nodes of the node that generates the view data. DataWorks preferentially schedules the node that generates the view data based on the scheduling dependencies. When the descendant nodes use the public subquery, the nodes query data from the materialized view that has generated data. This reduces the repetitive calculation of data for governance purposes.
Prerequisites
A MaxCompute data source is added. For more information, see Add a MaxCompute data source.
The intelligent analysis feature on materialized views is enabled in MaxCompute. For more information, see Recommendations and management of materialized views.
Use process
Enable the intelligent recommendation feature on materialized views.
After you enable the intelligent recommendation feature on materialized views for a MaxCompute project that is associated with DataWorks DataStudio, the recommendations for creating materialized views are generated and displayed on the Materialized View Recommendation tab if the following conditions are met:
An auto triggered task is run for at least three consecutive days.
The number of rows of input data in a public subquery is greater than 1,000,000.
The public subquery contains operators that are related to data reorganization, such as join or aggregation operators.
View the recommendations and create a materialized view.
You can check whether the recommendations are valid and create and manage a materialized view.
If you configure the parameters in the Create Materialized View Refresh Node and Add Materialized View Refresh Node as Ancestor Node sections when you create a materialized view, the created materialized view takes effect in scheduling scenarios in DataWorks.
A materialized view refresh node is automatically added as an ancestor node of nodes that have similar computing logic. When an ancestor node or the node that generates the source table generates new output, the materialized view is refreshed to produce the latest data for descendant nodes to query.
Manage materialized views.
You can view the usage of the materialized views in the current workspace and analyze or delete the materialized views whose output data does not meet expectations.
Procedure
Step 1: Enable or disable the intelligent recommendation feature on materialized views
Only users that are assigned the Workspace Administrator role can enable or disable the intelligent recommendation feature on materialized views.
Go to the Data Governance Center page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, click Go to Data Governance Center.
In the top navigation bar of the Data Governance Center page, click Workbench. In the left-side navigation pane, click Materialized View.
On the Materialized View page, select a workspace from the Workspace drop-down list and click Settings for Materialized View Recommendation. In the Settings for Materialized View Recommendation dialog box, turn on the switch in the Intelligent Materialized View Recommendation column for the projects for which you want to enable the intelligent recommendation feature on materialized views.
Step 2: View the recommended materialized views
The materialized view recommendation module can automatically identify materialized views that can be created and the associated computing tasks. The related owners can determine whether the materialized views are valid and manage their creation process.
To ensure accurate recommendations, the system generates and displays relevant recommendations on the Materialized View Recommendation tab only when the following conditions are met after you enable the intelligent recommendation feature on materialized views:
An auto triggered task is run for at least three consecutive days.
The number of rows of input data in a public subquery is greater than 1,000,000.
The public subquery contains operators that are related to data reorganization, such as join or aggregation operators.
Select a workspace from the Workspace drop-down list in the top navigation bar and click the Materialized View Recommendation tab.
Modify the filter conditions, such as Project and Analysis Time Interval, to check whether recommendations for creating materialized views are generated.
If recommendations for creating materialized views are available in the time period specified by the Analysis Time Interval parameter, you can view the recommendations.
Parameters:
Materialized View Recommendation: the identifier of a materialized view recommendation task. The identifier includes the summary information about the public subquery. Format of an identifier:
Aggregate: xx; Inputs: xx, xx;
.Aggregate: indicates whether the recommended public subquery contains the aggregate operation.
Inputs: lists the names of all source tables used by the public subquery.
Number of Possibly Affected SQL Statements: the number of jobs that use the public subquery.
Possibly Affected Node: the number of scheduling nodes that use the public subquery.
Recommendation Rating: the extent of recommendation, which is determined by the repetition of subqueries, subquery complexity, and the number of input data records.
Click the identifier of a materialized view recommendation task in the Materialized View Recommendation column to view the details of the public subquery. The details include the source table information, the public subquery, jobs, and scheduling tasks.
Step 3: Create a materialized view
On the Materialized View Recommendation tab, determine whether to create a materialized view based on your business requirements and recommendation details. To create a materialized view, find the recommendation based on which you want to create a materialized view and click Create Materialized View in the Actions column.
On the Create Materialized View page, configure the parameters for the materialized view.
In the Create Materialized View section, create a materialized view refresh node and generate a materialized view based on the public subquery of input tasks.
Parameter
Description
Input Table
The input tables, which are automatically obtained and do not need to be modified.
Create Materialized View or Not
The first time you create a materialized view, the default value Create Materialized View is used. You do not need to change the value.
NoteIf a materialized view for which the same input tables are selected exists, you can set the parameter to Select Existing Materialized View based on your business requirements.
Materialized View Name
The name of the materialized view. You can specify a name based on your business requirements.
Lifecycle
The lifecycle of the materialized view. Valid values: 1 Day, 7 Days, and Custom Days.
Refresh Type
The refresh mode of the materialized view.
Refresh at Fixed Interval
If the table from which you want to read data is the output of a task in DataWorks, Time Management Cron Mode is selected by default. Otherwise, Refresh at Fixed Interval is selected.
Time Management Cron Mode
The materialized view is refreshed at the most recent output time of the node that generates the table from which you want to read data.
Do Not Refresh
Preview Statement
The SQL statements of the materialized view.
Optional. In the Create Materialized View Refresh Node section, create a materialized view refresh node. When the node that generates the source table generates new output, the materialized view refresh node dynamically refreshes the materialized view to materialize the latest data.
Configure the Materialized View Refresh Node Name and Node Running Timeout Period parameters.
Optional. In the Add Materialized View Refresh Node as Ancestor Node section, add the materialized view refresh node as an ancestor node of nodes that have similar computing logic to increase the query hit rate of incremental data.
The computing tasks displayed in this section include tasks in the current workspace and tasks in other workspaces. You can click View Details to view the details of the computing tasks.
Click Create and Execute. DataWorks starts to create a materialized view. You can view the status of the materialized view in the Operation Status column on the Materialized View Recommendation tab, or click View Details in the Actions column to view the creation progress.
Valid values of the Operation Status parameter:
To Be Created: The materialized view is not created by using DataWorks.
Created: The materialized view is created.
Creating: The process of creating the materialized view by using DataWorks is initiated but the creation is not finished.
Create Failed: The process of creating the materialized view by using DataWorks is initiated but the creation failed.
You can click View Details to view the details of the creation process of the materialized view.
If you configure the parameters in the Create Materialized View Refresh Node and Add Materialized View Refresh Node as Ancestor Node sections when you create a materialized view, a materialized view refresh node is added as an ancestor node of nodes that have similar computing logic. The owner of the materialized view refresh node is the account that is used to create the materialized view.
Step 4: Manage materialized views
You can view the usage of the materialized views in the current workspace and analyze or delete the materialized views whose output data does not meet expectations.
Select a workspace from the Workspace drop-down list in the top navigation bar and click the Materialized View Management tab.
Modify the filter conditions, such as Project, to view the materialized views that are created by using the recommendations.
ImportantIf your MaxCompute project uses the pay-as-you-go billing method for its quota, the total saved compute capacity is calculated by using the following formula:
Amount of input data × SQL complexity
. The fee for MaxCompute SQL jobs is calculated by using the following formula:Compute capacity × Unit price
. For more information, see Computing pricing.Click the name of a materialized view in the Materialized View column to view the details of the materialized view.
Click Delete in the Actions column of a materialized view to delete the materialized view.
Configuration example
The test data used in this example is from a MaxCompute public dataset. For more information, see TPC-DS data. You can use the tables related to your business data for configuration.
Prepare data
Add a MaxCompute data source and associate the data source with DataWorks DataStudio.
Create a zero load node named
Materialized_View_Test
in DataStudio and specify the zero load node as the root node of descendant nodes.In this example, create four ODPS nodes in DataStudio.
The
dws_customer_info_1
anddws_customer_info_2
nodes use the public subquery and obtain data from theods_mc_tpc_customer
andods_mc_tpc_customer_address
nodes.ods_mc_tpc_customer
The parameter name of a scheduling parameter is
bday
and the value of the scheduling parameter is$[yyyymmdd]
.CREATE TABLE IF NOT EXISTS ods_mc_tpc_customer ( c_customer_sk BIGINT NOT NULL ,c_customer_id CHAR(16) NOT NULL ,c_current_cdemo_sk BIGINT ,c_current_hdemo_sk BIGINT ,c_current_addr_sk BIGINT ,c_first_shipto_date_sk BIGINT ,c_first_sales_date_sk BIGINT ,c_salutation CHAR(10) ,c_first_name CHAR(20) ,c_last_name CHAR(30) ,c_preferred_cust_flag CHAR(1) ,c_birth_day BIGINT ,c_birth_month BIGINT ,c_birth_year BIGINT ,c_birth_country VARCHAR(20) ,c_login CHAR(13) ,c_email_address CHAR(50) ,c_last_review_date_sk CHAR(10) ) COMMENT 'TABLE COMMENT' PARTITIONED BY (ds STRING COMMENT 'Partition') LIFECYCLE 90; SET odps.namespace.schema=TRUE; INSERT OVERWRITE TABLE ods_mc_tpc_customer PARTITION(ds='${bday}') select * from BIGDATA_PUBLIC_DATASET.TPCDS_10T.customer;
ods_mc_tpc_customer_address
The parameter name of a scheduling parameter is
bday
and the value of the scheduling parameter is$[yyyymmdd]
.CREATE TABLE IF NOT EXISTS ods_mc_tpc_customer_address ( ca_address_sk BIGINT NOT NULL ,ca_address_id CHAR(16) NOT NULL ,ca_street_number CHAR(10) ,ca_street_name VARCHAR(60) ,ca_street_type CHAR(15) ,ca_suite_number CHAR(10) ,ca_city VARCHAR(60) ,ca_county VARCHAR(30) ,ca_state CHAR(2) ,ca_zip CHAR(10) ,ca_country VARCHAR(20) ,ca_gmt_offset DECIMAL(5,2) ,ca_location_type CHAR(20) ) COMMENT 'TABLE COMMENT' PARTITIONED BY (ds STRING COMMENT 'Partition') LIFECYCLE 90; SET odps.namespace.schema=TRUE; INSERT OVERWRITE TABLE ods_mc_tpc_customer_address PARTITION(ds='${bday}') select * from BIGDATA_PUBLIC_DATASET.TPCDS_10T.customer_address;
dws_customer_info_1
The parameter name of a scheduling parameter is
bday
and the value of the scheduling parameter is$[yyyymmdd]
.CREATE TABLE IF NOT EXISTS dws_customer_info_1 ( c_customer_sk BIGINT NOT NULL ,c_customer_id CHAR(16) NOT NULL ,c_first_name CHAR(20) ,c_last_name CHAR(30) ,c_email_address CHAR(50) ,ca_suite_number CHAR(10) ,ca_city VARCHAR(60) ,ca_county VARCHAR(30) ,ca_state CHAR(2) ) COMMENT 'TABLE COMMENT' PARTITIONED BY ( ds STRING COMMENT 'Partition' ) LIFECYCLE 90 ; INSERT OVERWRITE TABLE dws_customer_info_1 PARTITION (ds = '${bday}') SELECT t02.c_customer_sk ,t02.c_customer_id ,t02.c_first_name ,t02.c_last_name ,t02.c_email_address ,t03.ca_suite_number ,t03.ca_city ,t03.ca_county ,t03.ca_state FROM ( SELECT * FROM ods_mc_tpc_customer_address WHERE ds = '${bday}' ) t03 JOIN ( SELECT * FROM ods_mc_tpc_customer WHERE ds = '${bday}' ) t02 ON t03.ca_address_sk = t02.c_current_addr_sk ;
dws_customer_info_2
The parameter name of a scheduling parameter is
bday
and the value of the scheduling parameter is$[yyyymmdd]
.CREATE TABLE IF NOT EXISTS dws_customer_info_2 ( c_customer_sk BIGINT NOT NULL ,c_customer_id CHAR(16) NOT NULL ,c_first_name CHAR(20) ,c_last_name CHAR(30) ,c_email_address CHAR(50) ,ca_suite_number CHAR(10) ,ca_city VARCHAR(60) ,ca_county VARCHAR(30) ,ca_state CHAR(2) ) COMMENT 'TABLE COMMENT' PARTITIONED BY ( ds STRING COMMENT 'Partition' ) LIFECYCLE 90 ; INSERT OVERWRITE TABLE dws_customer_info_2 PARTITION (ds = '${bday}') SELECT t02.c_customer_sk ,t02.c_customer_id ,t02.c_first_name ,t02.c_last_name ,t02.c_email_address ,t03.ca_suite_number ,t03.ca_city ,t03.ca_county ,t03.ca_state FROM ( SELECT * FROM ods_mc_tpc_customer_address WHERE ds = '${bday}' ) t03 JOIN ( SELECT * FROM ods_mc_tpc_customer WHERE ds = '${bday}' ) t02 ON t03.ca_address_sk = t02.c_current_addr_sk ;
Save, commit, and then deploy the nodes to the production environment. You can view the dependencies of the nodes in Operation Center.
Enable materialized views
Enable the intelligent analysis feature on materialized views in MaxCompute. For more information, see Recommendations and management of materialized views.
Enable the intelligent recommendation feature on materialized views in DataWorks Data Governance Center. For more information, see Step 1: Enable or disable the intelligent recommendation feature on materialized views.
After the auto triggered tasks are run for more than three days, you can view the recommendations for creating materialized views in DataWorks Data Governance Center.
NoteThe test data used in this example is 10-TB TPC-DS datasets, which meet the requirements for generating materialized view recommendations. If you do not view the generated recommendations on the Materialized View Recommendation tab, check whether your data meets the requirements.
An auto triggered task is run for at least three consecutive days.
The number of rows of input data in a public subquery is greater than 1,000,000.
The public subquery contains operators that are related to data reorganization, such as join or aggregation operators.
Create a materialized view
On the Materialized View Recommendation tab, find a recommendation and click Create Materialized View in the Actions column. For more information, see Step 3: Create a materialized view.
NoteThe parameters in the Create Materialized View Refresh Node and Add Materialized View Refresh Node as Ancestor Node sections are configured by default and do not need to be modified.
Click Create and Execute. On the Materialized View Recommendation tab, you can view the creation details.
After the materialized view is created, you can see that the materialized view is used in scheduling nodes when the nodes are run for the second time. For duplicate subqueries, the nodes query data from the related materialized view refresh node.
References
For information about how to create a materialized view in MaxCompute, see Materialized view operations.
For information about how to enable the intelligent analysis feature on materialized views in MaxCompute, see Recommendations and management of materialized views.