All Products
Search
Document Center

Realtime Compute for Apache Flink:Clean up expired data

Last Updated:Oct 09, 2024

You can clean up expired data to release storage space, optimize resource usage, and improve system operating efficiency. This topic describes how to clean up expired data in Apache Paimon tables by modifying the expiration time of savepoint files, configuring the expiration time of partitions, and deleting discarded files.

Usage notes

Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.5 or later supports Apache Paimon tables.

Modify the expiration time of savepoint files

Important

A savepoint can be used to restore historical data. To ensure accurate data restoration, the historical data files that are associated with a savepoint file cannot be deleted before the savepoint expires.

As savepoint files continue to be generated, the storage space that is occupied by historical data files gradually increases. Therefore, savepoint files that are no longer used must be cleaned up to release the storage space that is occupied by the associated historical data files.

The following table describes the parameters used to determine the expiration time of a savepoint file. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information about how to modify the parameters, see the "Modify the schema of an Apache Paimon table" section of the Manage Apache Paimon catalogs topic.

Parameter

Description

Data Type

Default value

snapshot.num-retained.min

The minimum number of savepoint files that can be saved.

Integer

10

snapshot.num-retained.max

The maximum number of savepoint files that can be saved.

Integer

2147483647

snapshot.time-retained

The maximum retention period of a savepoint file.

Duration

1h

If the number of current savepoint files is greater than the value of the savepoint.num-retained.min parameter and the earliest savepoint file has been retained for a period of time that is longer than the value of the savepoint.time-retained parameter, savepoint cleanup is triggered. If the number of current savepoint files is greater than the value of the savepoint.num-retained.max parameter, savepoint cleanup is also triggered.

Configure the expiration time of partitions

If your business requires data in the most recent period of time, you can partition the data by time and configure the partition expiration time. This way, the system can automatically delete historical partitions to release storage space.

Important

Data files in a partition are completely deleted only when the savepoint files that contain the related partition expiration events expire.

The expiration of a partition is determined by the three parameters that are described in the following table. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information about how to modify the parameters, see the "Modify the schema of an Apache Paimon table" section of the Manage Apache Paimon catalogs topic.

Parameter

Description

Remarks

partition.expiration-time

The validity period of a partition.

The value of this parameter is a period of time. Examples: 12h and 7d.

partition.timestamp-pattern

The pattern that is used to convert a partition value into a time string.

Each partition key column in the value of this parameter is represented by a dollar sign ($) and a column name.

partition.timestamp-formatter

The pattern that is used to convert a time string into a timestamp.

  • If this parameter is not specified, the yyyy-MM-dd HH:mm:ss or yyyy-MM-dd pattern is used by default.

  • All patterns that are compatible with DateTimeFormatter of Java can be used.

If the period of time for which a partition exists exceeds the value of the partition.expiration-time parameter, the partition is deleted. The period of time for which a partition exists is obtained based on the difference between the current system time and the timestamp of the converted partition value. A partition value is converted into a timestamp based on the following logic:

  1. The partition.timestamp-pattern parameter specifies a pattern to convert a partition value into a time string.

  2. The partition.timestamp-formatter parameter specifies a pattern to convert a time string into a timestamp.

Examples:

  • If a partition contains one partition key column named dt, you can specify 'partition.timestamp-pattern' = '$dt' to convert the value of the dt=20240308 partition into the 20240308 time string. You can also specify ion.timestamp-formatter' = 'yyyyMMdd' to convert the time string into a timestamp.

  • If a partition contains three partition key columns named year, month, and day, you can specify 'partition.timestamp-pattern' = '$year-$month-$day' to convert the value of the year=2023,month=04,day=21 partition into the 2023-04-21 time string. In this case, you do not need to specify the partition.timestamp-formatter parameter because the time string is in the default pattern yyyy-MM-dd.

  • If a partition contains four partition key columns named year, month, day, and hour, you can specify 'partition.timestamp-pattern' = '$year-$month-$day $hour:00:00' to convert the value of the year=2023,month=04,day=21,hour=17 partition into the 2023-04-21 17:00:00 time string. In this case, you do not need to specify the partition.timestamp-formatter parameter because the time string is in the default pattern yyyy-MM-dd HH:mm:ss.

Delete a discarded file

Some uncommitted temporary files may still be stored in the directories of Apache Paimon tables due to reasons such as deployment error reporting and restart. The discarded files cannot be deleted after the savepoint expires. You must perform the following steps to delete the discarded files:

  1. Log on to the Realtime Compute for Apache Flink console and create a script. For more information about how to create a script, see Create a script.

  2. In the script editor, enter the following SQL statement:

    CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');

    <catalog-name> indicates the name of the Apache Paimon catalog. <database-name> indicates the name of the database in which the Apache Paimon table resides. <table-name> indicates the name of the Apache Paimon table.

    By default, only discarded files that are retained for more than one day can be deleted. You can also specify a time parameter to specify the latest creation time of discarded files that can be deleted. The following sample code shows how to delete the discarded files created no later than 12:00:00 on October 31, 2023 from the mycat.mydb.mytbl table:

    CALL `mycat`.sys.remove_orphan_files('mydb.mytbl', '2023-10-31 12:00:00');
  3. Select the SQL code that you entered, and click Run in the upper-left corner of the script editor.

    After the discarded files are deleted, the Results tab in the lower part of the script editor displays the total number of deleted files.

References

  • For more information about the common optimization methods of Apache Paimon primary key tables, see Performance optimization.

  • The consumption of an Apache Paimon table depends on savepoint files. If the maximum retention period of savepoints is excessively short or the consumption efficiency of a deployment is low, the savepoint files of the Apache Paimon table that is being consumed may be deleted due to expiration. In this case, the "File xxx not found, Possible causes" error message appears. For more information about how to troubleshoot this issue, see the "What do I do if the "File xxx not found, Possible causes" error message appears when deployments are read from an Apache Paimon table?" section of the FAQ about upstream and downstream storage topic