Collect statistics on accounts with top N costs and top N time-consuming jobs of MaxCompute projects - MaxCompute

During MaxCompute development, data developers need to calculate the costs of accounts in MaxCompute projects and the durations of jobs to help plan and adjust jobs. This topic describes how to use the MaxCompute metadata service Information Schema to collect statistics on accounts with top N costs and top N time-consuming jobs and push the information to customer groups by using DingTalk.

Background information

In most cases, data developers use MaxCompute based on a DataWorks workspace in standard mode. MaxCompute records the same Alibaba Cloud account as the execution account of most jobs in Information Schema. RAM users are recorded as the execution accounts of only a small number of jobs. In this case, data developers are concerned about how to collect statistics on the costs of each account and time-consuming jobs. To resolve this issue, MaxCompute provides the following solutions:

To collect statistics on the costs of an account, you can query the usage details on the Bill Details page in the Billing Management console. However, you cannot identify the RAM users to which the usage details belong. The TASKS_HISTORY view of the Information Schema service records details of completed jobs in a MaxCompute project and retains the data of the previous 14 days. You can back up data in the TASKS_HISTORY view to the specified MaxCompute project and collect statistics on accounts with top N costs based on the data.
You can collect statistics on top N time-consuming jobs based on the data in the TASKS_HISTORY view.

For more information about the features and limits of the Information Schema service, see Project-level Information Schema (To be phased out).

To collect statistics on accounts with top N costs and top N time-consuming jobs of MaxCompute projects, perform the following steps:

Step 1: Obtain the Information Schema service.
(Optional) Step 2: Grant permissions to users except the project owner.
Step 3: Download and back up metadata.
Step 4: Create tables and a DataWorks node to collect statistics on accounts with top N costs and time-consuming jobs.
Step 5: Configure the scheduling properties of the upstream and downstream nodes and run the current node.

Step 1: Obtain the Information Schema service

From March 1, 2024, MaxCompute no longer automatically provides the project-level Information Schema service for new projects. By default, no package of project-level Information Schema is provided for the projects that are created on March 1, 2024 and onwards. If you want to query metadata, you can use tenant-level Information Schema to obtain more comprehensive information. For more information about how to use tenant-level Information Schema, see Tenant-level Information Schema.

Before you use Information Schema in an existing project, you must obtain the permissions to access the project metadata. To obtain the permissions, install the Information Schema permission package as the project owner or a RAM user that is assigned the Super_Administrator role. For more information, see Assign a role to a user. You can use one of the following methods to install the permission package:

Log on to the MaxCompute client and run the following command:
```
install package Information_Schema.systables;
```
Log on to the DataWorks console and go to the Ad Hoc Query page. For more information about ad hoc queries, see Use the ad-hoc query feature to execute SQL statements (optional). Run the following command:
```
install package Information_Schema.systables;
```

Note

To collect metadata of multiple MaxCompute projects, you must install the Information Schema permission package for each MaxCompute project. Then, you must insert the backup metadata of each MaxCompute project into the same table for centralized statistical analysis.

(Optional) Step 2: Grant permissions to users except the project owner

The views provided by Information Schema contain all the user data at the project level. By default, the owner of a project can view the user data of this project. Other users or roles in the project must be granted the required permissions to view the data. For more information, see Cross-project resource access based on packages.

Syntax of the statements that are used to grant permissions to users or roles:

grant <actions> on package Information_Schema.systables to user <user_name>;
grant <actions> on package Information_Schema.systables to role <role_name>;

actions: the permissions that you want to grant. Set the value to Read.
user_name: an Alibaba Cloud account or RAM user that is added to the project.
You can run the list users; command on the MaxCompute client to obtain user accounts.
role_name: a role that is added to the project.
You can run the list roles; command on the MaxCompute client to query the name of the role.

Example:

grant read on package Information_Schema.systables to user RAM$Bob@aliyun.com:user01;

Step 3: Download and back up metadata

Create a metadata backup table in the MaxCompute project and write metadata to the backup table at regular intervals. The following example describes the procedure on the MaxCompute client.

Log on to the MaxCompute client and run the following command to create a metadata backup table.

-- project_name specifies the name of the MaxCompute project. 
create table if not exists <project_name>.information_history
(
    task_catalog STRING
    ,task_schema STRING
    ,task_name STRING
    ,task_type STRING
    ,inst_id STRING
    ,`status` STRING
    ,owner_id STRING
    ,owner_name STRING
    ,result STRING
    ,start_time DATETIME
    ,end_time DATETIME
    ,input_records BIGINT
    ,output_records BIGINT
    ,input_bytes BIGINT
    ,output_bytes BIGINT
    ,input_tables STRING
    ,output_tables STRING
    ,operation_text STRING
    ,signature STRING
    ,complexity DOUBLE
    ,cost_cpu DOUBLE
    ,cost_mem DOUBLE
    ,settings STRING
    ,ds STRING
);

On the DataStudio page in the DataWorks console, create an ODPS SQL node named information_history and configure periodic scheduling to write data to the backup table information_history at regular intervals. Then, click the icon in the upper-left corner to save the configuration.
For more information about how to create an ODPS SQL node, see Develop a MaxCompute SQL task.
The following sample code provides an example for an ODPS SQL node.
```
-- project_name specifies the name of the MaxCompute project. 
use <project_name>;
insert into table <project_name>.information_history select * from information_schema.tasks_history where ds ='datetime1';
```
${datetime1} is a scheduling parameter of DataWorks. To configure this parameter, perform the following operations: On the right side of the ODPS SQL node, click the Properties tab. In the General section of the Properties panel, add the ${datetime1} parameter and assign the ${yyyymmdd} value to this parameter.
Note
If you want to analyze the metadata of multiple MaxCompute projects at the same time, you can create multiple ODPS SQL nodes and write the metadata of these MaxCompute projects to the same backup table.

Step 4: Create tables and a DataWorks node to collect statistics on accounts with top N costs and time-consuming jobs

The settings field in the TASKS_HISTORY view records information that is scheduled by the upper-layer application or specified by users. The information is saved in the JSON format. The information includes fields such as USERAGENT, BIZID, SKYNET_ID, and SKYNET_NODENAME. You can locate the information about the RAM user that creates the job based on the information in the settings field. Therefore, you can collect statistics on accounts with top N costs and top N time-consuming jobs based on the backup table. To collect statistics, perform the following steps:

Log on to the MaxCompute client. Create a table named user_ram to record the names and IDs of RAM users.
Sample statement:
```
create table if not exists <project_name>.user_ram
(
    user_id STRING
    ,user_name STRING
);
```

Create a table named cost_topn to record the details of the accounts with top N costs.

Sample statement:

create table if not exists <project_name>.cost_topn
(
    cost_sum DECIMAL(38,5)
    ,task_owner STRING
)
partitioned by 
(
    ds STRING
);

Create a table named time_topn to record the details of top N time-consuming jobs.

Sample statement:

create table if not exists <project_name>.time_topn
(
    inst_id STRING
    ,cost_time BIGINT
    ,task_owner STRING
)
partitioned by 
(
    ds STRING
);

On the DataStudio page in the DataWorks console, create an ODPS SQL node named topn and configure periodic scheduling to write statistical data from the cost_topn table to the user_ram table at regular intervals. Click the icon in the upper-left corner to save the configuration.

For more information about how to create an ODPS SQL node, see Develop a MaxCompute SQL task.

The following sample code provides an example for an ODPS SQL node.

-- Enable the MaxCompute V2.0 data type edition. For more information about the MaxCompute V2.0 data type edition, see documentation of the MaxCompute V2.0 data type edition. 
set odps.sql.decimal.odps2=true;
-- Write metadata to the cost_topn and time_topn tables. user_id specifies the account ID. You can view the account ID on the Security Settings page. 
insert into table <project_name>.cost_topn partition (ds = '${datetime1}')
select   
nvl(cost_sum,0) cost_sum
        ,case when a.task_owner='<user_id>' or a.task_owner='<user_id>' or a.task_owner='<user_id>' then b.user_name 
              else a.task_owner 
         end task_owner 
from    (
            select  inst_id
                    ,owner_name
                    ,task_type
                    ,a.input_bytes
                    ,a.cost_cpu
                    ,a.status
                    ,case    when a.task_type = 'SQL' then cast(a.input_bytes/1024/1024/1024 * a.complexity * 0.3 as DECIMAL(18,5) )
                             when a.task_type = 'SQLRT' then cast(a.input_bytes/1024/1024/1024 * a.complexity * 0.3 as DECIMAL(18,5) )
                             when a.task_type = 'CUPID' and a.status='Terminated'then cast(a.cost_cpu/100/3600 * 0.66 as DECIMAL(18,5) ) 
                             else 0 
                     end cost_sum
                    ,a.settings
                    ,get_json_object(settings, "$.SKYNET_ONDUTY") owner
                    ,case    when get_json_object(a.settings, "$.SKYNET_ONDUTY") is null then owner_name 
                             else get_json_object(a.settings, "$.SKYNET_ONDUTY") 
                     end task_owner
            from    information_history
            where   ds = '${datetime1}'
        ) a
left join <project_name>.user_ram b
on      a.task_owner = b.user_id;

insert into table <project_name>.time_topn partition(ds = '${datetime1}')
select  inst_id
        ,cost_time
        ,case    when a.task_owner='<user_id>' or a.task_owner='<user_id>' or a.task_owner='<user_id>' then b.user_name 
                 else a.task_owner 
         end task_owner
from    (
            select  inst_id
                    ,task_type
                    ,status
                    ,datediff(a.end_time, a.start_time, 'ss') AS cost_time
                    ,case    when get_json_object(a.settings, "$.SKYNET_ONDUTY") is null then owner_name 
                             else get_json_object(a.settings, "$.SKYNET_ONDUTY") 
                     end task_owner
            from    <project_name>.information_history a
            where   ds = '${datetime1}'
        ) a
left join <project_name>.user_ram b
on      a.task_owner = b.user_id
;

Note

In the preceding sample code, task_type = 'SQL' indicates an SQL job, task_type = 'SQLRT' indicates a MaxCompute Query Acceleration (MCQA) job, and task_type = 'CUPID' indicates a Spark job. If you want to collect statistics on other billable jobs, such as MapReduce and Mars jobs, you can add code lines based on the billing formula. For more information about billing, see Computing pricing.

${datetime1} is a scheduling parameter of DataWorks. To configure this parameter, perform the following operations: On the right side of the ODPS SQL node, click the Properties tab. In the General section of the Properties panel, add the ${datetime1} parameter and assign the ${yyyymmdd} value to this parameter.

Step 5: Configure the scheduling properties of the upstream and downstream nodes and run the current node

Draw lines in the canvas of the workflow to configure scheduling dependencies between the information_history, topn, and dingsend nodes. In the Properties panel of the workflow, configure the re-run attribute and upstream nodes of each node. After the configuration is complete, right-click a node and select Run Current Node.

For more information about how to configure dependencies, see Configure same-cycle scheduling dependencies.

For more information about how to configure the upstream and downstream nodes, see Configure input and output parameters.

References

Technical support

If you have questions or suggestions about MaxCompute, search for the DingTalk group ID 11782920 and join the DingTalk group for technical support.