Troubleshoot the issue that WAL logs are accumulated in PostgreSQL

Updated at: 2024-10-10 02:29

During peak hours, an excessively large number of WAL logs are generated in PostgreSQL. WAL is short for write-ahead logging. The checkpointer process deletes expired WAL logs at regular intervals. However, in production environments, WAL logs may fail to be deleted due to inappropriate operations. In this case, a large amount of storage is occupied. This topic describes how to troubleshoot the issue that WAL logs are accumulated in the primary database or read-only databases.

Background information

WAL is a key component of PostgreSQL to ensure data security and improve system reliability and performance. WAL helps prevent data loss and ensure that data can be restored in a reliable manner even if multiple faults occur.

Log accumulation in the primary database

If WAL logs are accumulated in the primary database, you can identify and troubleshoot the issue from the following aspects.

Inactive replication slots or unreported LSN from the consumer

  • A replication slot is a key tool in PostgreSQL to implement high availability and disaster recovery. You can use replication slots to prevent WAL logs from being deleted and prevent replication interruptions. However, if a replication slot is inactive, WAL logs associated with the slot are not deleted and keep accumulating.

  • If LSN is not reported by the consumer in a timely manner, the size of WAL logs keeps increasing.

You can use the pg_replication_slots system view to view the information about replication slots, such as LSN. For example, you can use the following SQL statements to query the size of WAL logs that are accumulated for a replication slot. The accumulation causes replication latency between the primary and secondary databases. In this case, the accumulated WAL logs must be deleted.

SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_insert_lsn(), restart_lsn)) AS delay_size
FROM pg_replication_slots;

If the size displayed in the query result is large or equal to the number of accumulated WAL logs, evaluate and delete the replication slot based on your business requirements.

Incorrect parameter settings

If the values of the wal_keep_segments, wal_keep_size, and max_wal_size parameters are excessively large, a large number of WAL logs are accumulated. Check and modify the values of these parameters based on your business requirements.

VACUUM storm and a large number of writes

In most cases, a VACUUM storm refers to a large number of automatic or manual VACUUM operations that are performed on a database at the same time. When these operations are performed, a large number of WAL logs may be generated. This causes a significant increase in I/O loads and affects the database performance. In addition, WAL logs may not be deleted in a timely manner. We recommend that you configure parameters in the VACUUM statement and schedule the execution of the VACUUM statements based on your business requirements.

The scenario in which a large number of writes occur is similar to a VACUUM storm. You must schedule data writes based on your business requirements.

Log accumulation in read-only databases

If WAL logs are accumulated in a read-only database, you can identify and troubleshoot the issue from the following aspects.

Replication latency

The following list describes the possible causes of the common issue that latency occurs when WAL logs are replayed on a read-only instance.

  • A long-running transaction exists on the read-only database and conflicts with the replay of the WAL logs. In this case, you must evaluate and change the values of the hot_standby_feedback and max_standby_streaming_delay parameters based on your business requirements.

    For example, if the hot_standby_feedback parameter is set to off and the max_standby_streaming_delay parameter is set to a large value, the execution of a long-running transaction on the read-only database may cause replay latency.

  • The specifications of the primary and read-only databases are different. If the computing and storage specifications of the read-only database are lower than those of the primary database, replication latency may occur. As a result, the WAL logs that are not replayed cannot be deleted. In this case, you must evaluate and select appropriate specifications for the read-only database based on your business requirements.

What to do next

For ApsaraDB RDS for PostgreSQL instances, if the issue that WAL logs are accumulated persists after you perform the preceding troubleshooting operations, contact the technical support of ApsaraDB RDS for PostgreSQL.

References

You can manually delete inactive replication slots to allow AliPG to automatically delete WAL logs. For more information, see Use the WAL log management feature.

  • On this page (1)
  • Background information
  • Log accumulation in the primary database
  • Inactive replication slots or unreported LSN from the consumer
  • Incorrect parameter settings
  • VACUUM storm and a large number of writes
  • Log accumulation in read-only databases
  • Replication latency
  • What to do next
  • References
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare