All Products
Search
Document Center

Realtime Compute for Apache Flink:Clean up expired data

Last Updated:Jul 18, 2024

You can clean up expired data to release storage space, optimize resource usage, and improve system operating efficiency. This topic describes how to clean up expired data in Apache Paimon tables by adjusting the expiration time of snapshot files, configuring the expiration time of partitions, and deleting discarded files.

Precautions

Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.5 or later supports Apache Paimon tables.

Adjust the expiration time of snapshot files

Important

A snapshot can be used to restore historical data. To ensure accurate data restoration, the historical data files that are associated with a snapshot file cannot be deleted before the snapshot expires.

As snapshot files continue to be generated, the storage space that is occupied by historical data files gradually increases. Therefore, snapshot files that are no longer used must be cleaned up to release the storage space that is occupied by the associated historical data files.

The following table describes the parameters used to determine the expiration time of a snapshot file. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information about how to modify the parameters, see Modify the schema of a Apache Paimon table.

Parameter

Description

Data type

Default value

snapshot.num-retained.min

The minimum number of snapshot files that can be saved.

INTEGER

10

snapshot.num-retained.max

The maximum number of snapshot files that can be saved.

INTEGER

2147483647

snapshot.time-retained

The maximum retention period of a snapshot file.

DURATION

1h

If the number of current snapshot files is greater than the value of the snapshot.num-retained.min parameter and the earliest snapshot file has been retained for a period of time that is longer than the value of the snapshot.time-retained parameter, snapshot cleanup is triggered. If the number of current snapshot files is greater than the value of the snapshot.num-retained.max parameter, snapshot cleanup is also triggered.

Configure the expiration time of partitions

If your business requires data in the most recent period of time, you can partition the data by time and configure the partition expiration time. This way, the system can automatically delete historical partitions to release storage space.

Important

Data files in a partition are completely deleted only when the snapshot files that contain the related partition expiration events expire.

The expiration of a partition is determined by the three parameters that are described in the following table. You can execute the ALTER TABLE statement to modify the parameters. You can also use SQL hints to modify the parameters in a draft that is run for data writing. For more information, see Modify the schema of an Apache Paimon table.

Parameter

Description

Remarks

partition.expiration-time

The validity period of a partition.

The value of this parameter is a period of time, such as 12h or 7d.

partition.timestamp-pattern

The pattern that is used to convert a partition value into a time string.

Each partition key column in the value of this parameter is represented by a dollar sign ($) and a column name.

partition.timestamp-formatter

The pattern that is used to convert a time string into a timestamp.

  • If this parameter is not configured, the yyyy-MM-dd HH:mm:ss or yyyy-MM-dd pattern is used by default.

  • All patterns that are compatible with DateTimeFormatter of Java can be used.

If the period of time for which a partition exists exceeds the value of the partition.expiration-time parameter, the partition is deleted. The period of time for which a partition exists is obtained based on the difference between the current system time and the timestamp of the converted partition value. A partition value is converted into a timestamp based on the following logic:

  1. The partition.timestamp-pattern parameter specifies a pattern to convert a partition value into a time string.

  2. The partition.timestamp-formatter parameter specifies a pattern to convert a time string into a timestamp.

Examples:

  • If a partition contains one partition key column named dt, you can configure 'partition.timestamp-pattern' = '$dt' to convert the value of the dt=20240308 partition into the 20240308 time string. You can also configure ion.timestamp-formatter' = 'yyyyMMdd' to convert the time string into a timestamp.

  • If a partition contains three partition key columns named year, month, and day, you can configure 'partition.timestamp-pattern' = '$year-$month-$day' to convert the value of the year=2023,month=04,day=21 partition into the 2023-04-21 time string. In this case, you do not need to configure the partition.timestamp-formatter parameter because the time string is in the default pattern yyyy-MM-dd.

  • If a partition contains four partition key columns named year, month, day, and hour, you can configure 'partition.timestamp-pattern' = '$year-$month-$day $hour:00:00' to convert the value of the year=2023,month=04,day=21,hour=17 partition into the 2023-04-21 17:00:00 time string. In this case, you do not need to configure the partition.timestamp-formatter parameter because the time string is in the default pattern yyyy-MM-dd HH:mm:ss.

Delete a discarded file

Some uncommitted temporary files may still be stored in the directories of Apache Paimon tables due to reasons such as deployment error reporting and restart. The discarded files cannot be deleted after the snapshot expires. You must perform the following steps to delete the discarded files:

  1. Log on to the Realtime Compute for Apache Flink console and create a script. For more information about how to create a script, see Create a script.

  2. In the script editor, enter the following SQL statement:

    CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');

    <catalog-name> indicates the name of the Apache Paimon catalog. <database-name> indicates the name of the database in which the Apache Paimon table resides. <table-name> indicates the name of the Apache Paimon table.

    By default, only discarded files that are retained for more than one day can be deleted. You can also configure a time parameter to specify the latest creation time of discarded files that can be deleted. The following sample code shows that discarded files created no later than 12:00:00 on October 31, 2023 are deleted from the mycat.mydb.mytbl table.

    CALL `mycat`.sys.remove_orphan_files('mydb.mytbl', '2023-10-31 12:00:00');
  3. Select the SQL code that you entered, and click Run in the upper-left corner of the script editor.

    After the discarded files are deleted, the Results tab in the lower part of the script editor displays the total number of deleted files.

References