×
Community Blog Practical Use of MaxCompute Metadata: Data Download Audit

Practical Use of MaxCompute Metadata: Data Download Audit

This article mainly introduces MaxCompute's tenant-level Information Schema and focuses on statistical analysis through the TUNNELS_HISTORY view of metadata.

By Haiqing

The tenant-level Information Schema of MaxCompute provides project metadata and historical data from the perspective of tenants. You can pull a type of metadata from all your projects in the same metadata center at a time for statistical analysis. We are launching a series of articles about metadata usage practices.

This article focuses on statistical analysis through the TUNNELS_HISTORY view of metadata. If you haven't used the tenant-level Information Schema, read the background information, feature introduction, fees, usage limits, and precautions written in Tenant-level Information Schema in advance to avoid unnecessary problems.

View Information Generating Internet Download Fees

Scenario

Upon reviewing the bill details in the user center, it was discovered that several projects incurred unexpected Internet download fees. In order to address this, it is necessary to examine the specific download requests for governance.

Solution

The primary reason for MaxCompute generating Internet download fees is the successful initiation of data download requests through the Tunnel Download channel over the Internet. You can download the consumption details through the user center bill details for analysis. You can find the specific project and consumption date, and then download the daily usage information based on the usage details, filter the data that is classified into DownloadEx and InstanceDownloadEx and upload them to the MaxCompute table, and query the data with the tenant-level Information Schema view SYSTEM_CATALOG.INFORMATION_SCHEMA.TUNNELS_HISTORY. By doing so, you can view more detailed download request information.

Step 1: Download usage details.

• Select MaxCompute (Pay-As-You-Go).

• Select metering specifications based on the default calculation quota of the project. For Pay-As-You-Go, select MaxCompute (Pay-As-You-Go); for Subscription, select ODPSDATAPLUS.

• The usage time should be the date on which the consumption was generated.

Step 2: Filter the metering information downloaded from the Internet and upload it to the MaxCompuet table. (The filtered data is classified into DownloadEx and InstanceDownloadEx). The statements for building the table are as follows:

CREATE TABLE IF NOT EXISTS mc_tunnel_metering 
(
    projectid STRING COMMENT ' Project ID'
    ,metering_id STRING COMMENT ' Metering information ID'
    ,type STRING COMMENT ' Data classification, including DownloadEx、InstanceDownloadEx'
    ,starttime STRING COMMENT ' Start time'
    ,endtime STRING COMMENT ' End time'
    ,download BIGINT COMMENT ' DownloadEx(Byte)'
    ,Region STRING COMMENT ' Region'
);

Step 3: Associate the SYSTEM_CATALOG.INFORMATION_SCHEMA.TUNNELS_HISTORY view to query more tunnel download request information.

SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
WITH a AS 
(
    SELECT  tunnel_catalog
            ,request_id
            ,object_type
            ,object_name
            ,partition_spec
            ,owner_id
            ,owner_name
            ,start_time
            ,end_time
            ,client_ip
            ,user_agent
            ,`columns`
    FROM    SYSTEM_CATALOG.INFORMATION_SCHEMA.TUNNELS_HISTORY
    WHERE   ds IN ('xxx','xxx')-- corresponding to consumption date.
)
,b AS 
(
    SELECT  metering_id
            ,download
            ,region
    FROM    mc_tunnel_metering
    WHERE   download > 0
)
SELECT  a.tunnel_catalog
        ,a.request_id
        ,b.metering_id
        ,(
                    b.download / 1024 / 1024 / 1024
        ) download_gb
        ,a.object_type
        ,a.object_name
        ,a.partition_spec
        ,a.owner_id
        ,a.owner_name
        ,a.start_time
        ,a.end_time
        ,a.client_ip
        ,a.user_agent
        ,a.`columns`
        ,b.region
FROM    a
JOIN    b
ON      a.request_id = b.metering_id
;

You can use the following results:

owner_name: the account that initiated the request. You can find the owner of the account.

tunnel catalog/object_name/partition_spec: project/table name or selected instance ID/table partition. You can locate the specific downloaded data.

client_ip: the IP address of the client. You can check the IP address.

Since you will not be charged when you download data over the VPC, you can use the preceding information to find the service that you want to request over the VPC. For more information about the endpoint information in different regions and different Internet connection methods, see Endpoint.

Count Data Outflow

Scenario

Calculate the daily inbound and outbound data volume of each project through the tunnel channel.

Solution

Use the SYSTEM_CATALOG.INFORMATION_SCHEMA.TUNNELS_HISTORY view of the tenant-level Information Schema to perform the count. The SQL statements are as follows:

SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.

SELECT  ds
        ,tunnel_catalog
        ,operate_type
        ,SUM(data_size) / 1024 / 1024 / 1024 s_data_size_gb
        ,SUM(length) s_length
FROM    SYSTEM_CATALOG.INFORMATION_SCHEMA.TUNNELS_HISTORY
WHERE   ds IN ('xxx','xxx')
GROUP BY ds
         ,tunnel_catalog
         ,operate_type
;

Note that the operate_type value is of the DOWNLOADINSTANCELOG type, primarily due to the use of Tunnel Instance Download when executing various SQL, spark, pyodps, and other tasks that necessitate data echo. If further details are required, additional information such as user_agent can be viewed.

Summary

The above outlines two common scenarios. In practice, you can use the SYSTEM_CATALOG.INFORMATION_SCHEMA.CATALOGS view to access more information to address your business requirements.

0 1 0
Share on

Alibaba Cloud MaxCompute

137 posts | 20 followers

You may also like

Comments

Alibaba Cloud MaxCompute

137 posts | 20 followers

Related Products