By Haiqing
The tenant-level Information Schema of MaxCompute provides project metadata and historical data from the perspective of tenants. You can pull a type of metadata from all your projects in the same metadata center at a time for statistical analysis. We are launching a series of articles about metadata usage practices.
This article focuses on job accounting through the TASKS_HISTORY view of metadata.
If you haven't used the tenant-level Information Schema, read the background information, feature introduction, fees, usage limits, and precautions written in Tenant-level Information Schema in advance to avoid unnecessary problems.
The current billing mode of MaxCompute is pay-as-you-go, and you want to evaluate whether it is suitable to switch to a subscription-based model and determine the appropriate number of subscription CUs (compute units) to purchase.
This scenario aims to save costs. Under the pay-as-you-go billing mode of MaxCompute, SQL jobs are not billed by CU, making it difficult to calculate the final cost directly. Instead, you can refer to and evaluate the CU consumption distribution of past jobs using the tenant-level SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view.
The approach is as follows:
The relevant code is as follows:
SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
SELECT task_catalog
,SUBSTR(end_time,1,13) end_time
,SUM(cost_cpu / 100 / 3600) cuh
FROM SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY
WHERE ds <= xxx -- Enter the time period required for statistical analysis. Data in no more than the last 15 days is available. Note that if running data on the day consumes a large amount of CUs, you must be cautious when using pay-as-you-go computing resources.
GROUP BY task_catalog
,SUBSTR(end_time,1,13)
;
If you perform SQL analysis in the MaxCompute console, you can analyze the results as follows:
As shown in the preceding figure, CU consumption is unevenly distributed. Only one hour per day consumes large CUs. According to this order, not all projects are suitable for subscription CUs. However, you can analyze and evaluate some project data by project.
If you use MaxCompute on a pay-as-you-go basis and have cost verification and audit requirements or job-based cost optimization requirements, you can use the SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view to count related costs.
Count the trend of the number of jobs of different types in each project:
SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
SELECT ds
,task_catalog
,task_type
,COUNT(inst_id) inst_id_cnt
FROM SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY
WHERE ds <= xxx -- Enter the time period required for statistical analysis. Data in no more than the last 15 days is available. Note that if running data on the day consumes a large amount of CUs, you must be cautious when using pay-as-you-go computing resources.
GROUP BY ds
,task_catalog
,task_type
;
Valid values:
• SQL: SQL jobs
• CUPID: Spark or Mars jobs
• SQLCost: SQL cost estimation jobs
• SQLRT: MaxCompute Query Acceleration (MCQA) SQL jobs
• LOT: MapReduce jobs
• PS: Parameter Server of PAI
• AlgoTask: Machine Learning Platform for AI (PAI) jobs
If you perform SQL analysis in the MaxCompute console, you can analyze the results as follows:
View who is accessing or writing data in a table.
SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
SELECT task_catalog
,task_type
,inst_id
,owner_name
,start_time
,end_time
,input_tables
,output_tables
,operation_text
,ext_platform_id
,ext_node_id
,ext_node_onduty
FROM SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY
WHERE ds =xxxx -- Enter the time period required for statistical analysis. Data in no more than the last 15 days is available. Note that if running data on the day consumes a large amount of CUs, you must be cautious when using pay-as-you-go computing resources.
AND input_tables LIKE 'dwd_github_events_odps'
;
• task_catalog
: The project associated with the job.
• owner_name
: Identifies the actual executor of the MaxCompute job. In some scenarios, a business may be associated with a single account, obscuring the true developer; for instance, DataWorks might designate an account to run MaxCompute jobs in a production environment.
• - operation_text
: The SQL statement.
• ext_platform_id
: Indicates the origin platform of the job when initiated from upstream sources. Currently, some jobs from DataWorks and Dataphin include this information. It's unattainable for others not explicitly passed.
• ext_node_id
: The specific name of the job initiated upstream.
• ext_node_onduty
: The uid of the upstream job owner, which can be linked with the SYSTEM_CATALOG.INFORMATION_SCHEMA.USERS table to find the user name.
The scenarios outlined are just a few examples. More information can be accessed through the SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view to address varying business needs.
Practical Use of MaxCompute Metadata: Data Permission Statistics
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - March 14, 2024
Alibaba Cloud MaxCompute - March 3, 2020
Alibaba Cloud MaxCompute - March 7, 2024
Alibaba Cloud MaxCompute - March 7, 2024
Alibaba Cloud MaxCompute - April 26, 2020
Alibaba Cloud MaxCompute - October 12, 2018
137 posts | 19 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreTransform your business into a customer-centric brand while keeping marketing campaigns cost effective.
Learn MoreMore Posts by Alibaba Cloud MaxCompute