You can use the open source MongoShake tool developed by Alibaba Cloud to synchronize data between MongoDB databases. This tool can be used in scenarios such as data analysis, disaster recovery, and active-active replication. This topic describes how to configure MongoShake to synchronize data between ApsaraDB for MongoDB replica set instances in real time.
MongoShake overview
MongoShake is a general-purpose Platform as a Service (PaaS) tool, which is written in the Go language by Alibaba Cloud. MongoShake reads the oplogs of a MongoDB database and replicates data based on the oplogs to meet specific requirements.
MongoShake also allows you to subscribe to and consume MongoDB logs. You can connect to MongoShake by using multiple methods such as SDKs, Kafka, and MetaQ. MongoShake is suitable for scenarios such as log subscription, data synchronization across data centers, and asynchronous cache eviction.
For more information about MongoShake, visit MongoShake homepage on GitHub.
Supported data sources
Source database | Destination database |
Self-managed MongoDB database hosted on ECS | Self-managed MongoDB database hosted on ECS |
Self-managed MongoDB database hosted on an on-premises machine | Self-managed MongoDB database hosted on an on-premises machine |
ApsaraDB for MongoDB instance | ApsaraDB for MongoDB instance |
MongoDB database on a third-party cloud | MongoDB database on a third-party cloud |
Usage notes
Do not perform data definition language (DDL) operations on the source database before full data synchronization is complete. Otherwise, data inconsistency may occur.
You cannot use MongoShake to synchronize data in the admin and local databases.
Required permissions on databases
Data source to be synchronized | Required permission |
Source ApsaraDB for MongoDB instance | readAnyDatabase permissions, read permissions on the local database, and read/write permissions on the mongoshake database Note The mongoshake database is created by MongoShake at the source when the incremental synchronization task starts. |
Destination ApsaraDB for MongoDB instance | readWriteAnyDatabase permission or readWrite permission on the destination database |
For more information about how to create and authorize MongoDB users, see Manage the permissions of MongoDB database users or db.createUser().
Preparations
For best synchronization performance, make sure that the source ApsaraDB for MongoDB replica set instance resides in a virtual private cloud (VPC). If the source instance resides in the classic network, switch the network type to VPC. For more information, see Switch the network type of an instance from classic network to VPC.
Create an ApsaraDB for MongoDB replica set instance as the synchronization destination. Select the same VPC as that used by the source ApsaraDB for MongoDB replica set instance to minimize network latency. For more information, see Create a replica set instance.
Create an Elastic Compute Service (ECS) instance to run MongoShake. Select the same VPC as that used by the source ApsaraDB for MongoDB instance to minimize network latency. For more information, see Create an ECS instance.
Add the private IP address of the ECS instance to the whitelists of the source and destination ApsaraDB for MongoDB instances. Make sure that the ECS instance can connect to the source and destination ApsaraDB for MongoDB instances. For more information, see Modify an IP address whitelist for an instance.
If the network type does not meet the preceding requirements, you can apply for public endpoints for the source and destination ApsaraDB for MongoDB instances. Then, add the public IP address of the ECS instance to the whitelists of the source and destination ApsaraDB for MongoDB instances. This way, you can synchronize data over the Internet. For more information, see Apply for a public endpoint and Modify an IP address whitelist for an instance.
Procedure
By default, the /test/mongoshake directory is used as the installation directory for MongoShake in this example.
Log on to an Elastic Compute Service (ECS) instance.
NoteYou can select a connection method based on your business scenario. For more information, see Methods for connecting to an ECS instance.
Run the following command to download the MongoShake package and rename the package
mongoshake.tar.gz
:wget "http://docs-aliyun.cn-hangzhou.oss.aliyun-inc.com/assets/attach/196977/jp_ja/1608863913991/mongo-shake-v2.4.16.tar.gz" -O mongoshake.tar.gz
NoteThe download URL for MongoShake V2.4.16 is used in this example. To download the latest version of MongoShake, visit Releases.
Run the following command to decompress the MongoShake package to the /test/mongoshake directory:
tar zxvf mongoshake.tar.gz && mv mongo-shake-v2.4.16 /test/mongoshake && cd /test/mongoshake/mongo-shake-v2.4.16
Run the
vi collector.conf
command to modify the collector.conf configuration file of MongoShake. The following table describes the parameters that you must configure to synchronize data between ApsaraDB for MongoDB instances.Parameter
Description
Example
mongo_urls
The connection string URI of the source ApsaraDB for MongoDB instance. The database account is test and the database is admin.
NoteWe recommend that you use a VPC endpoint to minimize network latency.
For more information about the format of a connection string URI, see Connect to a replica set instance.
mongo_urls = mongodb://test:****@dds-bp19f409d7512****.mongodb.rds.aliyuncs.com:3717,dds-bp19f409d7512****.mongodb.rds.aliyuncs.com:3717
NoteThe password cannot contain at signs (@). Otherwise, the connection may fail.
tunnel.address
The connection string URI of the destination ApsaraDB for MongoDB instance. The database account is test and the database is admin.
NoteWe recommend that you use a VPC endpoint to minimize network latency.
For more information about the format of a connection string URI, see Connect to a replica set instance.
tunnel.address = mongodb://test:****@dds-bp19f409d7512****.mongodb.rds.aliyuncs.com:3717,dds-bp19f409d7512****.mongodb.rds.aliyuncs.com:3717
NoteThe password cannot contain at signs (@). Otherwise, the connection may fail.
sync_mode
The data synchronization method. Valid values:
all: performs both full data synchronization and incremental data synchronization.
full: performs only full data synchronization.
incr: performs only incremental data synchronization.
NoteDefault value: incr.
sync_mode = all
NoteFor more information about all parameters in the collector.conf configuration file, see the Appendix section of this topic.
Run the following command to start the data synchronization task and generate the log information:
./collector.linux -conf=collector.conf -verbose
Check the log information. If the following log is displayed, the full data synchronization is complete and the incremental data synchronization starts.
[09:38:57 CST 2019/06/20] [INFO] (mongoshake/collector.(*ReplicationCoordinator).Run:80) finish full sync, start incr sync with timestamp: fullBeginTs[1560994443], fullFinishTs[1560994737]
Monitor the MongoShake status
When the incremental data synchronization starts, you can open a command line window to monitor MongoShake.
cd /test/mongoshake && ./mongoshake-stat --port=9100
mongoshake-stat
is a Python script. Before you run the script, install Python 2.7. For more information, visit Python official website.
The following figure shows sample monitoring information about MongoShake.
The following table describes the parameters that are included in the preceding monitoring information.
Parameter | Description |
logs_get/sec | The number of oplogs obtained per second. |
logs_repl/sec | The number of oplogs for replay operations performed per second. |
logs_success/sec | The number of oplogs for successful replay operations per second. |
lsn.time | The time when the last oplog was sent. |
lsn_ack.time | The time when the destination database acknowledges the write operation. |
lsn_ckpt.time | The time when the last checkpoint was generated. |
now.time | The current time. |
replset | The name of the replica set instance where the source database resides. |
Appendix
Table 1. All parameters in the collector.conf configuration file
Category | Parameter | Description | Example |
N/A | conf.version | The version of the configuration file. Do not change the value. |
|
Global configuration options | id | The ID of the synchronization task. This value is customizable. The global configuration includes the log file name, the name of the database that stores the checkpoint information, and the name of the destination database. |
|
master_quorum | Specifies whether the MongoShake node is the active node in high availability scenarios. If you use the active MongoShake node and standby MongoShake node to synchronize data from the same database, set this parameter to Valid values:
Note Default value: false. |
| |
full_sync.http_port | The HTTP port used to view the status of full data synchronization in MongoShake over the Internet. Note Default value: 9101. |
| |
incr_sync.http_port | The HTTP port used to view the status of incremental data synchronization in MongoShake over the Internet. Note Default value: 9100. |
| |
system_profile_port | The profiling port used to view internal stack information. |
| |
log.level | The level of the logs to be generated. Valid values:
Default value: info. |
| |
log.dir | The directory where the log file and PID file are stored. If you do not configure this parameter, the log file and PID file are stored in the logs directory in the working directory. Note This parameter must be set to an absolute path. |
| |
log.file | The name of the log file. This value is customizable. Note Default value: collector.log. |
| |
log.flush | Specifies whether to display every log entry on the screen. Valid values:
Note Default value: false. |
| |
sync_mode | The data synchronization method. Valid values:
Note Default value: incr. |
| |
mongo_urls | The connection string URI of the source ApsaraDB for MongoDB instance. The database account is test and the database is admin. Note
|
| |
mongo_cs_url | The endpoint of a ConfigServer node. If the source ApsaraDB for MongoDB instance is a sharded cluster instance, you must configure this parameter. For more information about how to apply for an endpoint for a ConfigServer node, see Apply for an endpoint for a shard or ConfigServer node in a sharded cluster instance. The database account is test and the database is admin. |
| |
mongo_s_url | The endpoint of a mongos node. If the source ApsaraDB for MongoDB instance is a sharded cluster instance, you must configure this parameter. You must specify the endpoint of at least one mongos node. Separate the endpoints of multiple mongos nodes with commas (,). For more information about how to apply for an endpoint for a mongos node, see Apply for an endpoint for a shard or ConfigServer node in a sharded cluster instance. The database account is test and the database is admin. |
| |
tunnel | The type of the tunnel used for synchronization. Valid values:
Note Default value: direct. |
| |
tunnel.address | The address used to connect to the destination ApsaraDB for MongoDB instance through the tunnel.
The database account is test and the database is admin. |
| |
tunnel.message | The type of the data to be written to the tunnel. This parameter is valid only when the tunnel parameter is set to
Note Default value: raw. |
| |
mongo_connect_mode | The type of the node from which MongoShake pulls data. This parameter is valid only when the tunnel parameter is set to
Note Default value: secondaryPreferred. |
| |
filter.namespace.black | The namespace blacklist for data synchronization. The specified namespaces are not synchronized to the destination database. Separate multiple namespaces with semicolons (;). Note A namespace is the standard name of a collection or index in ApsaraDB for MongoDB. It consists of a database name and a collection or index name. Example: |
| |
filter.namespace.white | The whitelist for data synchronization. Only the specified namespaces are synchronized to the destination database. Separate multiple namespaces with semicolons (;). |
| |
filter.pass.special.db | The special database from which you want to synchronize data to the destination database. You can specify multiple special databases. By default, the data in special databases such as admin, local, mongoshake, config, and system.views is not synchronized. You can configure this parameter to synchronize data from special databases. Separate multiple database names with semicolons (;). |
| |
filter.ddl_enable | Specifies whether to synchronize DDL operations. Valid values:
Note If the source ApsaraDB for MongoDB instance is a sharded cluster instance, you cannot set this parameter to true. |
| |
checkpoint.storage.url | The storage location of checkpoints, which are used for resumable transmission. If you do not configure this parameter, MongoShake writes checkpoints to the following databases based on the type of the source ApsaraDB for MongoDB instance:
The database account is test and the database is admin. |
| |
checkpoint.storage.db | The name of the database that stores checkpoints. Note Default value: mongoshake. |
| |
checkpoint.storage.collection | The name of the collection that stores checkpoints. If you use the active MongoShake node and standby MongoShake node to synchronize data from the same database, you can change this collection name to avoid the conflict caused by duplicate collection names. Note Default value: ckpt_default. |
| |
checkpoint.start_position | The start position for resumable transmission. If a checkpoint exists, this parameter is invalid. Specify a value for this parameter in the following format: Note Default value: 1970-01-01T00:00:00Z. |
| |
transform.namespace | The rule for renaming the source database or collection in the destination database. For example, you change the database name and collection name from |
| |
Full data synchronization options | full_sync.reader.collection_parallel | The maximum number of collections that can be concurrently pulled by MongoShake at a time. |
|
full_sync.reader.write_document_parallel | The number of concurrent threads used by MongoShake to write a collection. |
| |
full_sync.reader.document_batch_size | The number of documents to be written to the destination ApsaraDB for MongoDB instance at a time. For example, the value 128 indicates that 128 documents are written to the destination ApsaraDB for MongoDB instance at a time. |
| |
full_sync.collection_exist_drop | Specifies whether to delete the collections in the destination database that have the same names as the source collections before synchronization. Valid values:
|
| |
full_sync.create_index | Specifies whether to create indexes after the synchronization is complete. Valid values:
|
| |
full_sync.executor.insert_on_dup_update | Specifies whether to change the
|
| |
full_sync.executor.filter.orphan_document | Specifies whether to filter out orphaned documents if the source ApsaraDB for MongoDB instance is a sharded cluster instance. Valid values:
|
| |
full_sync.executor.majority_enable | Specifies whether to enable the majority write feature in the destination ApsaraDB for MongoDB instance. Valid values:
|
| |
Incremental data synchronization options | incr_sync.mongo_fetch_method | The method used to pull incremental data. Valid values:
Default value: oplog |
|
incr_sync.oplog.gids | The global ID used to implement two-way replication for ApsaraDB for MongoDB instances. |
| |
incr_sync.shard_key | The method used to distribute concurrent requests to internal worker threads. Do not modify this parameter value. |
| |
incr_sync.worker | The number of concurrent threads used to transmit oplogs. If your instance provides sufficient performance, you can increase the number of concurrent threads. Note If the source ApsaraDB for MongoDB instance is a sharded cluster instance, the number of concurrent threads must be equal to the number of shards. |
| |
incr_sync.worker.oplog_compressor | Specifies whether to decompress data to reduce network bandwidth usage. Valid values:
Note This parameter is valid only when the tunnel parameter is not set to |
| |
incr_sync.target_delay | The time delayed for synchronizing data between the source and destination ApsaraDB for MongoDB instances. By default, changes in the source database are synchronized to the destination database in real time. To avoid invalid operations, you can set this parameter to delay the synchronization. For example, if you set the Note The value 0 indicates that data is synchronized in real time. |
| |
incr_sync.worker.batch_queue_size | The parameters for configuring internal queues in MongoShake. Do not modify these parameters unless otherwise required. |
| |
incr_sync.adaptive.batching_max_size |
| ||
incr_sync.fetcher.buffer_capacity |
| ||
Direct synchronization options (valid only when the tunnel parameter is set to | incr_sync.executor.upsert | Specifies whether to change the
|
|
incr_sync.executor.insert_on_dup_update | Specifies whether to change the
|
| |
incr_sync.conflict_write_to | Specifies whether to record conflicting documents if write conflicts occur during the synchronization. Valid values:
|
| |
incr_sync.executor.majority_enable | Specifies whether to enable the majority write feature in the destination ApsaraDB for MongoDB instance. Valid values:
Note The majority write feature may compromise performance. |
|
FAQ
For more information, see FAQ.