You can clean up expired data to release storage space, optimize resource usage, and improve system operating efficiency. This topic describes how to clean up expired data in Apache Paimon tables by modifying the expiration time of savepoint files, configuring the expiration time of partitions, and deleting discarded files.
Usage notes
Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.5 or later supports Apache Paimon tables.
Modify the expiration time of savepoint files
A savepoint can be used to restore historical data. To ensure accurate data restoration, the historical data files that are associated with a savepoint file cannot be deleted before the savepoint expires.
As savepoint files continue to be generated, the storage space that is occupied by historical data files gradually increases. Therefore, savepoint files that are no longer used must be cleaned up to release the storage space that is occupied by the associated historical data files.
The following table describes the parameters used to determine the expiration time of a savepoint file. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information about how to modify the parameters, see the "Modify the schema of an Apache Paimon table" section of the Manage Apache Paimon catalogs topic.
Parameter | Description | Data Type | Default value |
snapshot.num-retained.min | The minimum number of savepoint files that can be saved. | Integer | 10 |
snapshot.num-retained.max | The maximum number of savepoint files that can be saved. | Integer | 2147483647 |
snapshot.time-retained | The maximum retention period of a savepoint file. | Duration | 1h |
If the number of current savepoint files is greater than the value of the savepoint.num-retained.min
parameter and the earliest savepoint file has been retained for a period of time that is longer than the value of the savepoint.time-retained
parameter, savepoint cleanup is triggered. If the number of current savepoint files is greater than the value of the savepoint.num-retained.max
parameter, savepoint cleanup is also triggered.
Configure the expiration time of partitions
If your business requires data in the most recent period of time, you can partition the data by time and configure the partition expiration time. This way, the system can automatically delete historical partitions to release storage space.
Data files in a partition are completely deleted only when the savepoint files that contain the related partition expiration events expire.
The expiration of a partition is determined by the three parameters that are described in the following table. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information about how to modify the parameters, see the "Modify the schema of an Apache Paimon table" section of the Manage Apache Paimon catalogs topic.
Parameter | Description | Remarks |
| The validity period of a partition. | The value of this parameter is a period of time. Examples: |
| The pattern that is used to convert a partition value into a time string. | Each partition key column in the value of this parameter is represented by a dollar sign ($) and a column name. |
| The pattern that is used to convert a time string into a timestamp. |
|
If the period of time for which a partition exists exceeds the value of the partition.expiration-time parameter, the partition is deleted. The period of time for which a partition exists is obtained based on the difference between the current system time and the timestamp of the converted partition value. A partition value is converted into a timestamp based on the following logic:
The
partition.timestamp-pattern
parameter specifies a pattern to convert a partition value into a time string.The
partition.timestamp-formatter
parameter specifies a pattern to convert a time string into a timestamp.
Examples:
If a partition contains one partition key column named
dt
, you can specify'partition.timestamp-pattern' = '$dt'
to convert the value of thedt=20240308
partition into the20240308
time string. You can also specifyion.timestamp-formatter' = 'yyyyMMdd'
to convert the time string into a timestamp.If a partition contains three partition key columns named
year
,month
, andday
, you can specify'partition.timestamp-pattern' = '$year-$month-$day'
to convert the value of theyear=2023,month=04,day=21
partition into the2023-04-21
time string. In this case, you do not need to specify thepartition.timestamp-formatter
parameter because the time string is in the default patternyyyy-MM-dd
.If a partition contains four partition key columns named
year
,month
,day
, andhour
, you can specify'partition.timestamp-pattern' = '$year-$month-$day $hour:00:00'
to convert the value of theyear=2023,month=04,day=21,hour=17
partition into the2023-04-21 17:00:00
time string. In this case, you do not need to specify thepartition.timestamp-formatter
parameter because the time string is in the default patternyyyy-MM-dd HH:mm:ss
.
Delete a discarded file
Some uncommitted temporary files may still be stored in the directories of Apache Paimon tables due to reasons such as deployment error reporting and restart. The discarded files cannot be deleted after the savepoint expires. You must perform the following steps to delete the discarded files:
Log on to the Realtime Compute for Apache Flink console and create a script. For more information about how to create a script, see Create a script.
In the script editor, enter the following SQL statement:
CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');
<catalog-name>
indicates the name of the Apache Paimon catalog.<database-name>
indicates the name of the database in which the Apache Paimon table resides.<table-name>
indicates the name of the Apache Paimon table.By default, only discarded files that are retained for more than one day can be deleted. You can also specify a time parameter to specify the latest creation time of discarded files that can be deleted. The following sample code shows how to delete the discarded files created no later than 12:00:00 on October 31, 2023 from the mycat.mydb.mytbl table:
CALL `mycat`.sys.remove_orphan_files('mydb.mytbl', '2023-10-31 12:00:00');
Select the SQL code that you entered, and click Run in the upper-left corner of the script editor.
After the discarded files are deleted, the Results tab in the lower part of the script editor displays the total number of deleted files.
References
For more information about the common optimization methods of Apache Paimon primary key tables, see Performance optimization.
The consumption of an Apache Paimon table depends on savepoint files. If the maximum retention period of savepoints is excessively short or the consumption efficiency of a deployment is low, the savepoint files of the Apache Paimon table that is being consumed may be deleted due to expiration. In this case, the "
File xxx not found, Possible causes
" error message appears. For more information about how to troubleshoot this issue, see the "What do I do if the "File xxx not found, Possible causes" error message appears when deployments are read from an Apache Paimon table?" section of the FAQ about upstream and downstream storage topic