All Products
Search
Document Center

AnalyticDB:Configure regular vacuum tasks

Last Updated:Sep 04, 2024

By default, the auto-vacuum feature is enabled for AnalyticDB for PostgreSQL instances. You can configure an appropriate vacuum frequency, use automatic cleanup tools, and periodically check the disk usage to ensure a tidy and efficient system. In most cases, you do not need to configure regular vacuum tasks. If your business involves a large number of update and delete operations or you want to clean up dirty data on a regular basis, you can configure vacuum tasks. This topic describes how to configure regular vacuum tasks in different scenarios.

Background information

When you add, remove, modify, or query data, temporary files and log files are generated and occupy disk storage. If you do not clean up the files at the earliest opportunity, the system may not run as expected due to insufficient disk storage. Therefore, you must configure regular vacuum tasks to ensure system stability.

Configure vacuum tasks

Vacuum tasks involve the following aspects:

  1. Delete temporary files: During database operations, specific temporary files, such as transaction logs and backup files, may be generated. The files must be deleted at the earliest opportunity if they are no longer used.

  2. Remove data marked for deletion: When you remove data from a database, the data is marked for deletion but is not removed from disks. You must remove the data from disks on a regular basis.

  3. Optimize table schemas: You can optimize the schemas of large tables to save disk storage. For example, use partitioned tables and indexes to improve data access efficiency and reduce disk usage.

  4. Periodically back up data: You can back up data on a regular basis to prevent data loss and remove earlier backup files to release disk storage.

The following section describes two scenarios in which you can configure vacuum tasks:

Clean up dirty data without locking tables

When you perform update or delete operations on a table, you can lock only the part of the table that you want to modify instead of the entire table. This method causes dirty data to be generated but improves the concurrency performance of the system and reduces the response time.

  • SQL statement: Connect to each database as the database owner and execute the VACUUM statement.

  • Frequency:

    • If you perform real-time updates by executing a large number of statements, such as INSERT VALUES, UPDATE, and DELETE, we recommend that you execute the VACUUM statement once a day or at least once a week.

    • If you perform batch update operations once a day, we recommend that you execute the VACUUM statement once a week or at least once a month.

  • System impacts: Tables are not locked and can be read and written to as expected. This increases CPU utilization and I/O usage, which may affect query performance.

  • Execution methods:

    • You can use the pg_cron extension to create a scheduled task to execute the VACUUM statement. For more information, see pg_cron.

    • You can run the following Linux shell script as a scheduled crontab task:

    #!/bin/bash
    export PGHOST=myinst.gpdb.rds.tbsite.net
    export PGPORT=3432
    export PGUSER=myuser
    export PGPASSWORD=mypass
    #do not echo command, just get a list of db
    dblist=`psql -d postgres -c "copy (select datname from pg_stat_database) to stdout"`
    for db in $dblist ; do
        #skip the system databases
        if [[ $db == template0 ]] ||  [[ $db == template1 ]] || [[ $db == postgres ]] || [[ $db == gpdb ]] ; then
            continue
        fi
        echo processing $db
        #vacuum all tables (catalog tables/user tables)
        psql -d $db -e -a -c "VACUUM;"
    done

Clean up dirty data within a maintenance window

You can clean up dirty data and release disk storage within a maintenance window.

  • SQL statements: Connect to each database as the database owner. The database owner must have full permissions on all objects.

    1. Execute the REINDEX SYSTEM <database name> statement.

    2. Execute the VACUUM FULL <table name> statement on each data table. Additionally, execute the REINDEX TABLE <table name> statement on column-oriented tables.

    3. If you create and delete system tables and indexes at a high frequency, we recommend that you execute the VACUUM FULL <table name> statement to maintain the tables on a regular basis. The system tables include pg_class, pg_attribute, and pg_index. Note: We recommend that you do not access the database when you execute the VACUUM FULL <table name> statement.

  • Frequency: Execute the preceding statements at least once a week. If most of your data is updated every day, you must execute the preceding statements once a day.

  • System impacts: The system locks the tables on which the VACUUM FULL or REINDEX statement is being executed. The tables cannot be read or written to during the statement execution. This increases CPU utilization and I/O usage.

  • Execution methods:

    • You can use the pg_cron extension to create a scheduled task to execute the VACUUM FULL and REINDEX statements. For more information, see pg_cron.

    • You can run the following Linux shell script as a scheduled crontab task:

#!/bin/bash
export PGHOST=myinst.gpdb.rds.tbsite.net
export PGPORT=3432
export PGUSER=myuser
export PGPASSWORD=mypass
#do not echo command, just get a list of db
dblist=`psql -d postgres -c "copy (select datname from pg_stat_database) to stdout"`
for db in $dblist ; do
    #skip system databases
    if [[ $db == template0 ]] ||  [[ $db == template1 ]] || [[ $db == postgres ]] || [[ $db == gpdb ]] ; then
        continue
    fi
    echo processing db "$db"
    #do a normal vacuum
    psql -d $db -e -a -c "VACUUM;"
    #reindex system tables firstly
    psql -d $db -e -a -c "REINDEX SYSTEM $db;"
    #use a temp file to store the table list, which could be vary large
    cp /dev/null tables.txt
    #query out only the normal user tables, excluding partitions of parent tables
    psql -d $db -c "copy (select '\"'||tables.schemaname||'\".' || '\"'||tables.tablename||'\"' from (select nspname as schemaname, relname as tablename from pg_catalog.pg_class, pg_catalog.pg_namespace, pg_catalog.pg_roles where pg_class.relnamespace = pg_namespace.oid and pg_namespace.nspowner = pg_roles.oid and pg_class.relkind='r' and (pg_namespace.nspname = 'public' or pg_roles.rolsuper = 'false' ) ) as tables(schemaname, tablename) left join pg_catalog.pg_partitions on pg_partitions.partitionschemaname=tables.schemaname and pg_partitions.partitiontablename=tables.tablename where pg_partitions.partitiontablename is null) to stdout;" > tables.txt
    while read line; do
        #some table name may contain the $ sign, so escape it
        line=`echo $line |sed 's/\\\$/\\\\\\\$/g'`
        echo processing table "$line"
        #vacuum full this table, which will lock the table
        psql -d $db -e -a -c "VACUUM FULL $line;"
        #reindex the table to reclaim index space
        psql -d $db -e -a -c "REINDEX TABLE $line;"
    done <tables.txt
done