Configure oplog-related parameters - ApsaraDB for MongoDB

If you improperly configure oplog-related parameters for an ApsaraDB for MongoDB instance, an abnormal synchronization occurs between the primary and secondary instances and data in the replica set instance cannot be restored to a previous point in time. This topic describes how to configure oplog-related parameters and explains the risks of improper oplog settings.

Oplog overview

A primary/secondary replication in an ApsaraDB for MongoDB replica set instance is implemented by operations log(oplog) entries in the instance. The oplog named local.oplog.rs in the instance is a special capped collection that stores the modification operations performed on all documents in databases. The oplog provides the following basic features:

In a replica set instance, write operations are completed only on the primary node of the instance and corresponding oplog entries are generated on the node. Other secondary nodes in the instance asynchronously replicate the oplog entries and replay the entries on themselves to maintain the consistent primary/secondary replication.
If an operation does not change any document or fails, no corresponding oplog entries are generated.
An oplog entry is identical across all nodes in a replica set instance. The entry in the oplog of the instance remains unchanged before and after replay.
Each operation in an oplog is idempotent. An oplog entry remains unchanged regardless of whether the entry is replayed once or multiple times.
An oplog entry is associated with a period of time. Each operation in the oplog in a replica set instance has a unique timestamp field (ts). The field consists of a UNIX timestamp and a counter. This way, you can determine the sequence between two oplog entries.
An oplog window indicates the time difference between the oldest oplog entry and the latest oplog entry in an oplog. A primary/secondary replication depends on the oplog window. Secondary nodes can synchronize data as expected only when the nodes identify an desired oplog entry in the oplog window of the synchronization source.
After the secondary nodes of a replica set instance are restarted or nodes are added to the instance, the nodes also depend on oplog entries in the oplog of the instance to check whether the nodes can become a normal member of the instance. If the nodes does not identify an desired oplog entry in the oplog window of the synchronization source, the node status becomes RECOVERING due to the too stale to catch up error.

Oplog size

The default size of the oplog in an ApsaraDB for MongoDB replica set or sharded cluster instance is 10% of the disk storage of the instance. For example, if your replica set or sharded cluster instance has a 500-GB disk storage, the oplog size of the instance is 50 GB. The oplog size is automatically adjusted as disks are expanded.

To adjust the oplog size, you can modify the value of the replication.oplogSizeMB parameter in the ApsaraDB for MongoDB console. The modification immediately takes effect without the need to restart your instance. For more information about how to modify configuration parameters, see Configure database parameters for an ApsaraDB for MongoDB instance.

You can use one of the following methods to view the actual oplog size:

View the value of the Disk Usage metric on the Monitoring Data page of an instance in the ApsaraDB for MongoDB console. For more information, see Basic monitoring.

Use mongo shell or mongosh to connect to an instance and then run the following command:

rs.printReplicationInfo()

The following result is returned:

configured oplog size:   192MB
log length start to end: 65422secs (18.17hrs)
oplog first event time:  Mon Jun 23 2014 17:47:18 GMT-0400 (EDT)
oplog last event time:   Tue Jun 24 2014 11:57:40 GMT-0400 (EDT)
now:                     Thu Jun 26 2014 14:24:39 GMT-0400 (EDT)

This result indicates that the oplog size is approximately 192 MB and the oplog window is approximately 18 hours.

Minimum retention period of oplog entries

In MongoDB 4.4 and later, the storage.oplogMinRetentionHours configuration file option is supported. The option is used to specify the minimum retention period of oplog entries, which ensures a sufficient oplog window.

The default value of the option is 0, which indicates that the minimum retention period of oplog entries is not specified. In this case, oplog entries are deleted only after the specified oplog size is reached. If the option is configured, oplog entries are deleted only when the following requirements are met:

The oplog size exceeds the value specified by the replication.oplogSizeMB parameter.
The timestamp of the oplog is earlier than the minimum retention period of oplog entries.

When the oplog size does not reach the value specified by the replication.oplogSizeMB parameter, such as when a large amount of data is not written to an initialized instance, the actual oplog window of the instance may be much larger than a specified minimum retention period of oplog entries. In this case, the oplog size is limited only by the replication.oplogSizeMB parameter. After the oplog size reaches the value specified by the replication.oplogSizeMB parameter, the oplog size is limited by the minimum retention period of oplog entries. When a large number of oplog entries are generated within a short period of time, the total oplog size may be much larger than the value specified by the replication.oplogSizeMB parameter.

To adjust the minimum retention period of oplog entries, you can modify the value of the storage.oplogMinRetentionHours parameter in the ApsaraDB for MongoDB console. The modification immediately takes effect without the need to restart your instance. For more information about how to modify configuration parameters, see Configure database parameters for an ApsaraDB for MongoDB instance.

To view the retention period of oplog entries in an instance, you can view the value of the Retention Period of Oplogs metric on the Monitoring Data page of the instance in the ApsaraDB for MongoDB console. For more information, see Basic monitoring.

Log backup operations for ApsaraDB for MongoDB instances

Log backup operations for all ApsaraDB for MongoDB instances are performed based on oplog entries. In a log backup of an instance, relevant control service processes continuously pull the latest oplog entry from the instance and upload the entry to Object Storage Service (OSS) in streaming mode to generate a series of log backup files. When the instance data is restored to a previous point in time, the log backup files are used to replay the pulled oplog entry.

In special cases, missing records may be generated in log backup files, which causes a failed point-in-time restoration. For more information, see Risk description.

Note

The missing records in log backup files have no the same definition as oplog hole mentioned in ApsaraDB for MongoDB documentation.

Best practices

Configure a proper oplog size or retention period

In most cases, the oplog size is the default value. However, we recommend that you increase the oplog size in the following scenarios:

Batch updates to documents at a high frequency
Each batch update operation generates multiple update operations for a single document, which generates a large number of oplog entries.
Repeated insert and delete operations
If an inserted document is retained for a period of time and then deleted, the disk space of the database to which the document belongs does not significantly increase. However, a large number of relevant oplog entries are generated in the database.
Large number of in-place updates to a document
If most operations for documents in your business are updates that do not increase the size of the documents, the updates generate a large number of oplog entries. However, the data volume of disks does not significantly change.

In the following scenarios, you can reduce the oplog size to make full use of the disk space:

Scenarios in which read operations are more frequently performed than write operations.
Scenarios in which cold data is stored.

Regardless of whether you specify the oplog size or retention period for an ApsaraDB for MongoDB instance, we recommend that you set the oplog window of the instance to more than 24 hours. In scenarios that require additional synchronization initialization (initial sync ), the oplog window needs to cover the time required for a node to complete full data synchronization. In most cases, the time is related to factors such as the overall data volume of the instance, the total number of databases and collections, and the instance type. The oplog window may cover a longer period of time.

Focus on the replication latency of secondary nodes and configure alert rules for the latency

If a replication latency occurs on secondary nodes and keeps increasing until the latency exceeds a specified oplog window, the nodes enter an abnormal state and cannot be recovered. Therefore, you must focus on the replication latency of secondary nodes in an ApsaraDB for MongoDB instance. If the replication latency keeps increasing, submit a ticket.

The following reasons cause a replication latency on secondary nodes:

Network latency, packet loss, or interruption.
Disk throughput of secondary nodes that reaches a bottleneck.
Write concern set to {w:1} and a large number of write workloads.
Primary/secondary replication of secondary nodes blocked due to specific kernel defects.
Other reasons that are not listed.

You can use one of the following methods to view the replication latency of secondary nodes:

View the value of the Primary/Secondary Replication Latency metric on the Monitoring Data page of an instance in the ApsaraDB for MongoDB console. For more information, see Basic monitoring.

Use mongo shell or mongosh to connect to an instance and then run the following command:

rs.printSecondaryReplicationInfo()

The following result is returned:

source: m1.example.net:27017
    syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)
    0 secs (0 hrs) behind the primary
source: m2.example.net:27017
    syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)
    0 secs (0 hrs) behind the primary

This result indicates that no replication latency exists in the two secondary nodes.

You can use the Alert Rules module in the ApsaraDB for MongoDB console to create a alert rule related to Replication Latency. We recommend that you set the alert threshold to more than 10 seconds. For more information, see Configure threshold-triggered alert rules for an ApsaraDB for MongoDB instance.

Risks

The following reasons may cause missing records in log backup files:

Your instance runs a major version earlier than MongoDB 3.4

Periodic noop is introduced in MongoDB 3.4 to adapt to the maxStalenessSeconds parameter of readPreference. For more information, see SERVER-23892. Periodic noop is designed to ensure that oplog entries are continuously updated even if no data is written. This way, you can determine the backwardness of the primary and secondary nodes in a replica set instance.

If your instance runs a major version earlier than MongoDB 3.4 and no data is written to the instance for a long period of time, oplog entries are no longer updated. Therefore, the log backup files of the instance cannot retrieve new oplog data, which causes missing records in the files. In this case, the instance data cannot be restored to a previous point in time.

A large amount of data is written to your instance for a short period of time and the oplog window of the instance has a short duration

The log backup data generated in ApsaraDB for MongoDB for a long period of time indicates that when the oplog generation speed of an instance reaches approximately 250GB/h to 330GB/h, log backup files may fail to collect the generated oplog entries in a timely manner, which causes missing records in the files.

You can estimate the oplog generation speed based on the oplog size and the oplog window. For example, if the oplog size of an instance is 20 GB and the oplog window of the instance is 0.06 hour, the oplog generation speed is approximately 333.3 GB/h.

High workloads are generated in the following scenarios:

Data Transmission Service (DTS), mongoShake, or other data synchronization tools are synchronizing data.
A large number of INSERT or UPDATE operations are performed in batch for a short period of time.
A large amount of data is written to databases.
A stress test is performed.

To prevent missing records in log backup files due to a large amount of data written for a long period of time, we recommend that you use the following optimization measures:

When you use synchronization tools, you must limit the data write speed. For example, you must limit the concurrency and batch size.
You must set the write concern to {w:"majority"} rather than {w:1}.

If the oplog generation speed is always high in your workloads, we recommend that you use the following optimization measures:

Use sharded cluster instances or increase the number of shards to reduce the oplog generation speed on a single shard.
Increase the oplog size or the minimum retention period of oplog entries based on your business scenarios to reserve a longer buffer period for log backup files. This way, the log backup files can collect missing oplog entries in a timely manner.

ApsaraDB for MongoDB:Best practices and risks of oplog settings