Merges all data files based on a specific policy. This helps improve query efficiency.
Description
Delta tables support near-real-time incremental data writing and time travel queries. If data is frequently written, a large number of small files are generated. An efficient compaction policy is required to merge small files and deduplicate data. The compaction policy must resolve the low I/O throughput issue caused by a large number of small files and relieve the pressure on the storage system. The compaction policy must also prevent serious write amplification and conflict-caused failures in frequent compaction scenarios.
The following data merging methods are supported:
Clustering: Only delta files are merged into a large file, and data content is not changed. Delta files contain data that is involved in committed operations. The system periodically performs the clustering operation based on factors such as the size and number of new files. No manual operation is required. This method helps improve the I/O throughput and stability of small files.
Compaction: All data files are merged based on a specific policy to generate a batch of base files. Only the latest status is stored for data rows that have the same primary key. The historical status or system column information is not stored. This helps improve query efficiency.
Syntax
alter table <table_name> [partition
(<partition_key> = '<partition_value>' [, ...])
]
compact major;
Precautions
If you want to manually trigger compaction, you must configure the following setting:
set odps.merge.task.mode=service;
The compaction operation improves query efficiency but increases storage costs of data files because new files are generated during the compaction. You can configure a trigger frequency based on your business requirements.
The compaction operation consumes computing resources for data reading and rewriting. If you use the pay-as-you-go billing method, you are charged for the compaction operation based on the following formula: Amount of data scanned × 1 × Unit price
. If you use the subscription billing method, the compaction operation uses the quota for subscription computing resources.
Example
-- Create a table.
create table mf_dt (pk bigint not null primary key, val bigint not null)
partitioned by (dd string, hh string)
tblproperties ("transactional"="true");
-- Insert data into the table.
insert into table mf_dt partition(dd='01', hh='01') values (1, 1), (2, 2);
insert into table mf_dt partition(dd='01', hh='01') values (2, 20), (3, 3);
insert into table mf_dt partition(dd='01', hh='01') values (3, 30), (4, 4);
-- Perform the compaction operation. After the compaction operation is complete, you can continue to query historical data.
set odps.merge.task.mode=service;
alter table mf_dt partition(dd='01', hh='01') compact major;
select * from mf_dt timestamp as of get_latest_timestamp('mf_dt') where dd='01' and hh='01';
select * from mf_dt timestamp as of get_latest_timestamp('mf_dt', 2) where dd='01' and hh='01';