When you use MaxCompute, a large amount of data may be accumulated and occupies a large number of storage resources. If a large amount of data still exists after you perform data governance such as reducing the data storage lifecycle and deleting unnecessary data, you can reorder data to save storage space when your storage budget is limited but computing resources are sufficient. This topic describes how to save storage space by reordering data. In this topic, tables in MaxCompute public datasets are used.
Background information
Data reordering allows you to sort data records based on field values and data characteristics to increase the compression rate. MaxCompute supports the following sorting methods for data reordering:
ORDER BY: sorts all data in the same reducer. For more information, see ORDER BY (order_condition).
DISTRIBUTE BY: controls how the output of a mapper is distributed to reducers. DISTRIBUTE BY ensures that the data records whose values of a specified field are the same are distributed to the same reducer. For more information, see DISTRIBUTE BY hash partition (distribute_condition).
SORT BY: sorts specific data. Data is sorted before it enters a reducer. SORT BY only ensures that data in a reducer is sorted based on a specific field. For more information, see SORT BY (sort_condition).
CLUSTER BY: sorts a specific field in ascending order. CLUSTER BY takes the similar effect as the DISTRIBUTE BY clause used together with the SORT BY clause.
ORDER BY is used for global sorting. This causes a high computing performance overhead. In addition, data cannot be processed in distributed manner when data is sorted in only one reducer. To resolve this issue, you can use the DISTRIBUTE BY clause together with the SORT BY clause or use the CLUSTER BY clause to reorder data.
Precautions
Data reordering occupies and consumes computing resources. If your instance uses the pay-as-you-go billing method, you are charged for the computing resources. You can determine whether to optimize storage by reordering data based on your business requirements.
Before you use the
DISTRIBUTE BY clause together with the SORT BYclause or use theCLUSTER BYclause to sort data in a table, check the distribution of values in the table. To improve the storage optimization effect, we recommend that you reorder data based on columns that contain a large amount of data.The storage optimization effect is better when data reordering is performed on a table that contains a large amount of data.
Example
In this example, the customer table in the TPC-DS public dataset of MaxCompute contains 1 TB of data. The DISTRIBUTE BY clause together with the SORT BY clause and the CLUSTER BY clause are separately used to sort data in the customer table. After data is reordered, the data in the customer table is written to a new data table. This example compares the data sizes of the new table and the original table to check the storage performance effect. For more information about the TPC-DS public dataset of MaxCompute, see TPC-DS data. The test procedure consists of the following steps:
Create tables to which you want to write reordered data.
Execute the following statement to create tables named
storage_test1andstorage_test2:create table if not exists <storage_test1|storage_test2>( c_customer_sk bigint, c_customer_id char(16), c_current_cdemo_sk bigint, c_current_hdemo_sk bigint, c_current_addr_sk bigint, c_first_shipto_date_sk bigint, c_first_sales_date_sk bigint, c_salutation char(10), c_first_name char(20), c_last_name char(30), c_preferred_cust_flag char(1), c_birth_day bigint, c_birth_month bigint, c_birth_year bigint, c_birth_country varchar(20), c_login char(13), c_email_address char(50), c_last_review_date_sk char(10) );Reorder data in the customer table, and then write the data to the new tables.
Use the following sorting methods to separately sort data in the
customertable in the TPC-DS public dataset, and then write the data to the new tables.NoteBefore you reorder data, check the distribution of values in the table. The data reordering effect is better for columns that contain a large amount of data. In this example, the
c_email_addressfield in thecustomertable contains a large amount of data.Use the
DISTRIBUTE BY clause together with the SORT BYclause to reorder data. Sample code:SET odps.namespace.schema = true; insert overwrite table storage_test1 select * from bigdata_public_dataset.tpcds_1t.customer distribute by c_email_address sort by c_email_address;Use the
CLUSTER BYclause to reorder data. Sample code:NoteAfter you use the
CLUSTER BYclause to reorder data, data may fail to be inserted in real time in specific scenarios.SET odps.namespace.schema = true; insert overwrite table storage_test2 select * from bigdata_public_dataset.tpcds_1t.customer cluster by c_email_address;
Check the data reordering effect.
Run the following DESC commands to view the sizes of the
bigdata_public_dataset.tpcds_1t.customertable and thestorage_test1andstorage_test2tables. For more information about the DESC commands, see DESC TABLE/VIEW.SET odps.namespace.schema = true; desc bigdata_public_dataset.tpcds_1t.customer;The returned table size is 469939587 bytes.
desc storage_test1;The returned table size is 449472614 bytes.
desc storage_test2;The returned table size is 449565021 bytes.
The returned results indicate that the size of the stored data is reduced by approximately 4.355% after data is reordered by using the
DISTRIBUTE BY clause together with the SORT BYclause. The storage optimization rate is calculated by using the following formula:(469939587 - 449472614)/469939587 × 100%.The returned results indicate that the size of the stored data is reduced by approximately 4.336% after data is reordered by using the
CLUSTER BYclause. The storage optimization rate is calculated by using the following formula:(469939587 - 449565021)/469939587 × 100%.