You can clean up expired data to release storage space, optimize resource usage, and improve system operating efficiency. This topic describes how to clean up expired data in Apache Paimon tables by adjusting the expiration time of snapshot files, configuring the expiration time of partitions, and deleting discarded files.
Precautions
Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.5 or later supports Apache Paimon tables.
Adjust the expiration time of snapshot files
A snapshot can be used to restore historical data. To ensure accurate data restoration, the historical data files that are associated with a snapshot file cannot be deleted before the snapshot expires.
As snapshot files continue to be generated, the storage space that is occupied by historical data files gradually increases. Therefore, snapshot files that are no longer used must be cleaned up to release the storage space that is occupied by the associated historical data files.
The following table describes the parameters used to determine the expiration time of a snapshot file. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information about how to modify the parameters, see Modify the schema of a Apache Paimon table.
Parameter | Description | Data type | Default value |
snapshot.num-retained.min | The minimum number of snapshot files that can be saved. | INTEGER | 10 |
snapshot.num-retained.max | The maximum number of snapshot files that can be saved. | INTEGER | 2147483647 |
snapshot.time-retained | The maximum retention period of a snapshot file. | DURATION | 1h |
If the number of current snapshot files is greater than the value of the snapshot.num-retained.min
parameter and the earliest snapshot file has been retained for a period of time that is longer than the value of the snapshot.time-retained
parameter, snapshot cleanup is triggered. If the number of current snapshot files is greater than the value of the snapshot.num-retained.max
parameter, snapshot cleanup is also triggered.
Configure the expiration time of partitions
If your business requires data in the most recent period of time, you can partition the data by time and configure the partition expiration time. This way, the system can automatically delete historical partitions to release storage space.
Data files in a partition are completely deleted only when the snapshot files that contain the related partition expiration events expire.
The expiration of a partition is determined by the three parameters that are described in the following table. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information, see Modify the schema of an Apache Paimon table.
Parameter | Description | Remarks |
| The validity period of a partition. | The value of this parameter is a period of time, such as |
| The pattern that is used to convert a partition value into a time string. | Each partition key column in the value of this parameter is represented by a dollar sign ($) and a column name. |
| The pattern that is used to convert a time string into a timestamp. |
|
If the period of time for which a partition exists exceeds the value of the partition.expiration-time parameter, the partition is deleted. The period of time for which a partition exists is obtained based on the difference between the current system time and the timestamp of the converted partition value. A partition value is converted into a timestamp based on the following logic:
The
partition.timestamp-pattern
parameter specifies a pattern to convert a partition value into a time string.The
partition.timestamp-formatter
parameter specifies a pattern to convert a time string into a timestamp.
Examples:
If a partition contains one partition key column named
dt
, you can configure'partition.timestamp-pattern' = '$dt'
to convert the value of thedt=20240308
partition into the20240308
time string. You can also configureion.timestamp-formatter' = 'yyyyMMdd'
to convert the time string into a timestamp.If a partition contains three partition key columns named
year
,month
, andday
, you can configure'partition.timestamp-pattern' = '$year-$month-$day'
to convert the value of theyear=2023,month=04,day=21
partition into the2023-04-21
time string. In this case, you do not need to configure thepartition.timestamp-formatter
parameter because the time string is in the default patternyyyy-MM-dd
.If a partition contains four partition key columns named
year
,month
,day
, andhour
, you can configure'partition.timestamp-pattern' = '$year-$month-$day $hour:00:00'
to convert the value of theyear=2023,month=04,day=21,hour=17
partition into the2023-04-21 17:00:00
time string. In this case, you do not need to configure thepartition.timestamp-formatter
parameter because the time string is in the default patternyyyy-MM-dd HH:mm:ss
.
Delete a discarded file
Some uncommitted temporary files may still be stored in the directories of Apache Paimon tables due to reasons such as deployment error reporting and restart. The discarded files cannot be deleted after the snapshot expires. You must perform the following steps to delete the discarded files:
Log on to the Realtime Compute for Apache Flink console and create a script. For more information about how to create a script, see Create a script.
In the script editor, enter the following SQL statement:
CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');
<catalog-name>
indicates the name of the Apache Paimon catalog.<database-name>
indicates the name of the database in which the Apache Paimon table resides.<table-name>
indicates the name of the Apache Paimon table.By default, only discarded files that are retained for more than one day can be deleted. You can also configure a time parameter to specify the latest creation time of discarded files that can be deleted. The following sample code shows that discarded files created no later than 12:00:00 on October 31, 2023 are deleted from the mycat.mydb.mytbl table.
CALL `mycat`.sys.remove_orphan_files('mydb.mytbl', '2023-10-31 12:00:00');
Select the SQL code that you entered, and click Run in the upper-left corner of the script editor.
After the discarded files are deleted, the Results tab in the lower part of the script editor displays the total number of deleted files.
References
For more information about the common optimization methods of Apache Paimon primary key tables, see Optimize the performance of Apache Paimon tables.
The consumption of an Apache Paimon table depends on snapshot files. If the maximum retention period of snapshots is excessively short or the consumption efficiency of a deployment is low, the snapshot files of the Apache Paimon table that is being consumed may be deleted due to expiration. In this case, the "
File xxx not found, Possible causes
" error message appears. For more information about the solution to this issue, see What do I do if the error message "File xxx not found, Possible causes" appears when deployments are read from an Apache Paimon table?