This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

Clustering optimization recommendations

Updated at: 2025-03-17 18:20

MaxCompute analyzes recent read and write patterns of tables to generate clustering recommendations, enhancing job performance and reducing CU consumption. You can evaluate the benefits and details of the recommendations before deciding to implement them.

For more information on optimizing jobs through clustering tables, see Hash Clustering.

Limits

  • This feature is currently available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), and China (Chengdu).

  • Clustering optimization recommendations do not support projects using the three-layer model.

  • While clustering optimization statistics aim to encompass the entire job run history, they exclude multiple Fuxi Job jobs, potentially omitting some jobs.

View clustering optimization recommendations

You can view recommended tables for all or specific projects in the current region on the Clustering Optimization tab. This includes estimated benefits and recommendation details to help you decide whether to adopt the suggestions. Follow these steps:

  1. Log on to the MaxCompute console and select a region in the upper left corner.

  2. In the left-side navigation pane, select Intelligent Optimization > Data Layout Optimization.

  3. On the Clustering Optimization tab, select Estimated Benefits to view recommended clustering tables. You can also use the following filters:

    Parameter

    Description

    Parameter

    Description

    Project Name

    Select the MaxCompute project name from the drop-down list. The default Is Not To Select Any Project.

    Table Name

    Enter the table name. Fuzzy search is supported. Multiple table names are separated by a half-width comma (,).

    Recommendation Generation Date

    The date the recommendation was generated. The default is the previous day.

    • Estimated benefit indicators.

      Indicator

      Description

      Indicator

      Description

      Estimated Beneficial Jobs/day

      The estimated number of jobs that can benefit each day after converting the recommended table to a clustering table.

      Estimated Shuffle Volume Savings/day

      The estimated amount of Shuffle volume that can be saved each day after converting the recommended table to a clustering table.

      Note

      Savings in Shuffle volume can effectively reduce the CU hours consumed by jobs. Typically, saving 1TB of Shuffle volume can save 2 to 4 CU hours per day.

    • Optimization recommendation list.

      Refer to the suggested parameter values in the list and view the recommendations for more details on table optimization.

      Column Name

      Description

      Column Name

      Description

      Project

      The project where the table with recommended clustering property modifications is located.

      Table Name

      The name of the table with recommended clustering property modifications.

      Recommended Clustering Type

      The recommended clustering type for the table. Currently, only Hash Clustering is recommended.

      Recommended Clusterkey

      The recommended ClusterKey for the table, mainly related to Shuffle Removal and point query data filtering.

      Recommended Sortkey

      The recommended SortKey for the table. Mainly related to data filtering and storage compression rate.

      Recommended Bucket Number

      The recommended number of Buckets for the table. Mainly related to the parallelism of writing tables and reading tables after Shuffle Removal.

      Recommendation Index

      From 1 star to 5 stars, the more stars, the more recommended it is to modify the clustering properties of the table. The calculation method of the star rating is as follows:

      Note
      • If no write job records are counted on the day, it is impossible to estimate the daily write increase cost, resulting in a deduction.

      • When there are many partitions that can be optimized, Cluster optimization needs to be effective after rewriting or actively rewriting many existing partitions with new data. This situation will result in a deduction.

      Estimated Daily Shuffle Volume Savings

      The estimated Shuffle volume savings per day after converting the table to a clustering table.

      Observation Interval

      The number of days the same optimization recommendation appears within a period.

      Operation

      Click View Recommendation to enter the Optimization Recommendation Table Details page, which includes the following content:

      • Optimization Recommendation

      • Performance Status

      • Estimated Benefit Overview

        • Estimated Shuffle Volume Savings/day

        • Beneficial Read Table Jobs

        • Full Write Table Jobs

        • Full Read Table Jobs

Apply clustering optimization recommendations

Directly apply clustering recommendations to the original table

  • Interface Operation

    For partitioned tables, you can directly apply the recommendations to convert the original table into the recommended clustering table with one click. Here's how:

    1. In the Data Layout Optimization interface, select Estimated Benefits and click the Operation column of the target table to View Details. Enter the Optimization Recommendation Table Details page.

    2. Click Apply Recommendation in the upper right corner to complete the conversion.

  • Operate through SQL commands as follows:

    -- Change the table to a Hash Clustering table
    ALTER TABLE <table_name> [CLUSTERED BY (<col_name> [, <col_name>, ...])
                           [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                           INTO <number_of_buckets> BUCKETS];

After the conversion, check the optimized clustering table and run read table jobs to ensure all jobs function as expected. If issues arise, perform a rollback operation promptly as follows.

-- Change the Hash Clustering table to a non-Hash Clustering table
ALTER TABLE <table_name> NOT CLUSTERED;
Important
  • After converting to a clustered table, you cannot perform incremental write operations such as INSERT INTO or Tunnel upload.

  • Clustering recommendations cannot be directly applied to non-partitioned tables. For the specific application of these recommendations to non-partitioned tables, see Apply clustering recommendations to new tables.

  • Modifying clustering properties will result in increased latency and CU consumption for write table jobs. Conversely, CU consumption for read table jobs will decrease, leading to overall CU savings.

  • In certain scenarios, it is advisable to rewrite partitions and verify the benefits for downstream jobs that can be optimized. For more information, see (Recommended) Rewrite partition data.

(Recommended) Rewrite partition data

In the following scenarios, it is necessary to rewrite partitions and verify the benefits for downstream optimizable jobs:

  • Tables used by high-priority and latency-sensitive jobs.

  • Tables with large partition data, where a single write exceeds 10TB.

  • Tables where downstream jobs read multiple partitions and need simultaneous rewriting to Cluster partitions for optimization.

After modifying clustering properties, they only affect new partition data. To optimize existing partitions, you must rewrite them. Here's how:

  • For large tables in jobs with daily full incremental merge writes, rewrite the last day's partition to mitigate increased costs and slower speeds when the full incremental job runs the next day.

    -- Assume the partition column is ds, and the daily new partitions are 20241015, 20241016. The new partition data is derived from merging incremental data from the previous day's partition
    INSERT OVERWRITE TABLE <table_name> PARTITION(ds) 
    SELECT * FROM <table_name> WHERE ds = max_pt('<table_name>');
  • When optimizable jobs read multiple existing partitions, rewrite those within the read range to apply Cluster optimization sooner.

    -- Assume the partition column is ds, and the daily new partitions are 20241015, 20241016, ..., with the read range starting from 20241015
    INSERT OVERWRITE TABLE <table_name> PARTITION(ds) 
    SELECT * FROM <table_name> WHERE ds >='20241015';

After rewriting, test run optimizable jobs to verify the effectiveness of the optimization.

Apply clustering recommendations to new tables

For non-partitioned tables, you cannot directly modify the original table's clustering properties. Instead, manually apply them by creating a new Cluster table. Follow these steps:

  1. Create a new Cluster table.

    -- Confirm the creation statement of the existing table
    SHOW CREATE TABLE <orignal_table>;
    -- Insert CLUSTER information in the appropriate position according to the CREATE TABLE syntax to create a new Cluster table
    CREATE TABLE <new_table> [CLUSTERED BY (<col_name> [, <col_name>, ...])
                              [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                              INTO <number_of_buckets> BUCKETS];
  2. Inject data into the new table.

    INSERT OVERWRITE TABLE <new_table>
    SELECT * FROM <orignal_table>;
  3. Rename the original table to a backup table.

    ALTER TABLE <orignal_table> RENAME TO <orignal_table_backup>;
  4. Rename the new table to the original table.

    ALTER TABLE <new_table> RENAME TO <orignal_table>;

The new table will have the same name as the original and will be clustered according to the new properties.

Clustering recommendation rollback

After applying the new clustering properties, ensure all jobs run as expected. If issues arise, perform a rollback promptly as follows:

  1. Delete the new table (named after the original).

    DROP TABLE IF EXISTS <orignal_table>;
  2. Rename the backup table to the original table.

    ALTER TABLE <orignal_table_backup> RENAME TO <orignal_table>;

View clustering optimization benefits

View the benefits of clustering optimization in the Clustering Optimization tab under Actual Benefits. Follow these steps:

  1. Log on to the MaxCompute console and select a region in the upper left corner.

  2. In the left-side navigation pane, select Intelligent Optimization > Data Layout Optimization.

  3. On the Clustering Optimization tab, select Actual Benefits and choose Benefit Statistics Interval to view a summary and details of the benefits from clustering tables with modified properties.

    • Explanation of benefit indicators.

      Indicator

      Description

      Indicator

      Description

      Beneficial Jobs

      The number of times the recently modified clustering table was read within the benefit statistics interval.

      Saved CU Hours

      The savings in CU hours consumed by all jobs reading the recently modified clustering table within the benefit statistics interval compared to before the table was modified to a clustering table.

      Saved Shuffle Volume Consumption

      The savings in Shuffle volume consumed by all jobs reading the recently modified clustering table within the benefit statistics interval compared to before the table was modified to a clustering table.

      Note

      Clustering optimization benefits are calculated based on the average consumption before modification and the average daily consumption after modification of jobs with the same Signature. The statistics cover Cluster tables modified according to recommendations within 365 days.

    • Optimized list.

      Column Name

      Description

      Column Name

      Description

      Project

      The project where the clustering table with modified clustering properties is located.

      Table Name

      The name of the table with modified clustering properties.

      Clustering Property Modification Time

      The date of the most recent modification of the table's clustering properties.

      Beneficial Jobs

      The number of times the table was read within the benefit statistics interval after modifying the clustering properties.

      Saved Computation Time

      The savings in computation time for jobs reading the table within the benefit statistics interval compared to before.

      Saved CU Hours

      The savings in CU hours consumed by jobs reading the table within the benefit statistics interval compared to before.

      Saved Shuffle Volume

      The savings in Shuffle volume consumed by jobs reading the table within the benefit statistics interval compared to before.

      Operation

      Click View Recommendation to enter the Optimized Table Details page, which includes the following content:

      • Performance Status

        • Clustering Type

        • ClusterKey

        • SortKey

        • Bucket Number

        • Partition Value

      • Benefit Overview

        • Beneficial Jobs

        • Saved CU Hours

        • Saved Shuffle Volume Consumption

      • Beneficial read table jobs list

        • Signature

        • Saved Computation Time

        • Saved CU Hours

        • Saved Shuffle Volume

    Important
    • After converting a table to a clustered table, the daily job benefit statistics are updated on the following day (T+1). You can view the real-time effects of optimization through job operation and maintenance tools or LogView.

    • Statistics on the benefits of clustering optimization are derived from historical job run records with the same Signature. These statistics are subject to various factors, such as daily job performance fluctuations. If the benefits fall short of expectations, please review the execution details of jobs on different dates before and after optimization to identify influencing factors.

    • Please note that clustering optimization benefit statistics serve as a guide only. The final CU savings should be calculated based on your bill.

  • On this page (1)
  • Limits
  • View clustering optimization recommendations
  • Apply clustering optimization recommendations
  • Directly apply clustering recommendations to the original table
  • Apply clustering recommendations to new tables
  • View clustering optimization benefits
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare