DataWorks provides MongoDB Reader and MongoDB Writer for you to read data from and write data to MongoDB data sources. This topic describes the capabilities of synchronizing data from or to MongoDB data sources.
Supported MongoDB versions
Only MongoDB 4.x and 5.x data sources are supported.
Limits
DataWorks allows you to use an account created for a MongoDB database to connect to the database. If you use an ApsaraDB for MongoDB data source, you can use the root account that is automatically created for the related ApsaraDB for MongoDB instance to connect to the desired ApsaraDB for MongoDB database. However, for security purposes, we recommend that you do not specify the root account as the account that is used to connect to the desired ApsaraDB for MongoDB database when you add a MongoDB data source.
If you want to add a MongoDB sharded cluster instance to DataWorks as a data source, you must configure the address of a mongos node in the instance when you add the data source. The address of a shard node in the cluster is not supported. If you configure the address of a shard node in the cluster when you add the data source, only data stored in the specified shard, instead of all data stored in the cluster, can be queried when the related synchronization task reads data. For information about mongos nodes and shard nodes, see mongos and Shards in the documentation for open source MongoDB.
MongoDB clusters that contain primary and secondary nodes are not supported.
If the number of parallel threads specified for a synchronization task that uses MongoDB Reader is greater than 1, all
_id
fields in the collection that is specified when you configure the synchronization task must be of the same data type, such as STRING or ObjectId. If the_id
fields are of different data types, some data cannot be synchronized.NoteIf the number of parallel threads specified for a synchronization task that uses MongoDB Reader is greater than 1, the
_id
fields are used to split the synchronization task. In this case, the_id
fields of the COMBINE data type are not supported. If the_id
fields are of different data types, you can use a single thread to synchronize data and leave the splitFactor parameter empty or set the splitFactor parameter to 1.
Data Integration does not support arrays. MongoDB supports arrays, and arrays support the indexing feature. You can configure parameters to convert strings into MongoDB arrays. Then, MongoDB Writer uses parallel threads to write the arrays to a MongoDB database.
You cannot access a self-managed MongoDB database over the Internet. You can access a self-managed MongoDB database only over an Alibaba Cloud internal network.
MongoDB clusters that are deployed based on Docker are not supported.
Data Integration does not allow you to specify the columns from which you want to read data in the configuration of the query parameter.
In a batch synchronization task that uses a MongoDB data source, the system automatically generates field mappings based on the following fields if MongoDB cannot obtain field structures:
col1
,col2
,col3
,col4
,col5
, andcol6
.When a batch synchronization task that uses a MongoDB data source is run, the
splitVector
command is preferentially used to split the task. However, some MongoDB versions do not support thesplitVector
command. In this case, the error messageno such cmd splitVector
is reported. To resolve this issue, when you configure the batch synchronization task, you can click theicon in the top toolbar of the configuration tab of the task to switch to the code editor, and add the following configuration item in the parameter field to avoid using the
splitVector
command."useSplitVector" : false
Supported data types
MongoDB data types supported by MongoDB Reader
MongoDB Reader supports most MongoDB data types. Make sure that the data types of your database are supported.
For fields of the supported data types, take note of the following items:
For fields of basic data types, the synchronization task automatically reads data of the fields from the related paths based on the field names that are specified in the column parameter by using the code editor, and converts the data types of the fields based on the data type mappings. You do not need to configure the type attribute in the column parameter. For information about the column parameter, see the Appendix: Code and parameters section in this topic.
Data type
MongoDB Reader for batch data read
Description
ObjectId
Supported
The ObjectId data type.
Double
Supported
A 64-bit floating point.
32-bit integer
Supported
A 32-bit integer.
64-bit integer
Supported
A 64-bit integer.
Decimal128
Supported
A 128-bit decimal-based floating-point number.
NoteIf the type attribute is configured as an embedded data type or the COMBINE data type for a field, the field is processed as an object when the field is converted into a JSON string. In this case, you must configure the
decimal128OutputType
parameter and set this parameter tobigDecimal
to convert the data type of the field into DECIMAL.String
Supported
The STRING data type.
Boolean
Supported
The Boolean data type.
Timestamp
Supported
The TIMESTAMP data type.
NoteThe BSONTimestamp class is used to store timestamps. You do not need to consider the impacts of different time zones. For information about the impacts of different time zones on data in MongoDB, see Time zone issues in MongoDB.
Date
Supported
The DATE data type.
For fields of complex data types, you can configure the type attribute to specify how to process the fields.
Data type
MongoDB Reader for batch data read
Description
Document
Supported
The embedded document data type.
If the type attribute is not configured, fields of this data type are converted into JSON strings.
If the type attribute is configured as
DOCUMENT
, fields are of an embedded data type. In this case, MongoDB Reader reads data from the fields based on the paths of the fields. For more information, see the Example for using the DOCUMENT data type to recursively parse nested fields section in this topic.
Array
Supported
The ARRAY data type.
If the type attribute is configured as
array.json
orarrays
, fields of this data type are converted into JSON strings.If the type attribute is configured as
array
ordocument.array
, fields are concatenated as strings by using a delimiter. The delimiter is specified by the splitter attribute in the configuration of the column parameter, and commas (,) are used as the delimiter by default.
ImportantData Integration does not support arrays. MongoDB supports arrays, and arrays support the indexing feature. You can configure parameters to convert strings into MongoDB arrays. Then, MongoDB Writer uses parallel threads to write the arrays to a MongoDB database.
Supported special data type: COMBINE
Data type | MongoDB Reader for batch data read | Description |
Combine | Supported | The custom data type supported by Data Integration. If the type attribute is configured as |
Data type mappings based on which MongoDB Reader converts data types
The following table lists the data type mappings based on which MongoDB Reader converts data types.
Category | MongoDB data type |
LONG | INT, LONG, document.INT, and document.LONG |
DOUBLE | DOUBLE and document.DOUBLE |
STRING | STRING, ARRAY, document.STRING, document.ARRAY, and COMBINE |
DATE | DATE and document.DATE |
BOOLEAN | BOOLEAN and document.BOOLEAN |
BYTES | BYTES and document.BYTES |
Data type mappings based on which MongoDB Writer converts data types
Category | MongoDB data type |
Integer | INT and LONG |
Floating point | DOUBLE |
String | STRING and ARRAY |
Date and time | DATE |
Boolean | BOOL |
Binary | BYTES |
Example for using the COMBINE data type
When MongoDB Reader reads data from a MongoDB database, MongoDB Reader combines multiple fields in MongoDB documents into a JSON string. For example, doc1, doc2, and doc3 are three MongoDB documents that contain different fields. The fields are represented by keys instead of key-value pairs. The keys a and b are common fields in these three documents. The key x_n represents a document-specific field.
doc1: a b x_1 x_2
doc2: a b x_2 x_3 x_4
doc3: a b x_5
To import the preceding MongoDB documents to MaxCompute, you must specify the fields that you want to retain, specify a name for each JSON string that is obtained, and specify the data type for each obtained JSON string as COMBINE in the configuration file. Make sure that the name of each obtained JSON string is different from the name of an existing field in the documents.
"column": [
{
"name": "a",
"type": "string",
},
{
"name": "b",
"type": "string",
},
{
"name": "doc",
"type": "combine",
}
]
The following table lists the output in MaxCompute.
odps_column1 | odps_column2 | odps_column3 |
a | b | {x_1,x_2} |
a | b | {x_2,x_3,x_4} |
a | b | {x_5} |
When you combine multiple fields in a MongoDB document and set the data type of each obtained JSON string to COMBINE, the result that is exported to MaxCompute contains only fields specific to the document. Common fields are automatically deleted.
In the preceding example, a and b are common fields in these three documents. After fields in the document file doc1: a b x_1 x_2
are combined and the data type of the obtained JSON strings is set to COMBINE, the result is {a,b,x_1,x_2}. When the result is exported to MaxCompute, common fields a and b are deleted, and the result is {x_1,x_2}.
Example for using the DOCUMENT data type to recursively parse nested fields
If fields in a MongoDB document are nested, you can set the data type of the nested fields that you want to synchronize to a destination to DOCUMENT. This way, a writer can recursively parse the fields when it writes the values of the fields to a destination. The following code provides a configuration example:
For example, a MongoDB document contains the nested field a.b.c, and the value of the field is "this is value". The value needs to be synchronized to a destination.
{ "name": "name1", "a": { "b": { "c": "this is value" } } }
You can configure the following fields that you want to synchronize to a destination:
{"name":"_id","type":"string"} {"name":"name","type":"string"} {"name":"a.b.c","type":"document"}
After you configure the preceding fields, the value of the nested field a.b.c can be written to the field c in the destination. After the related synchronization task is run, this is value
is written to the field c in the destination.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see the Appendix: Code and parameters section in this topic.
Configure a real-time synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Create a real-time synchronization task to synchronize incremental data from a single table and Configure a real-time synchronization task in DataStudio.
Configure a synchronization task to synchronize all data in a database
For information about how to configure a synchronization task to implement batch synchronization of all data in a database, one-time full synchronization and real-time incremental synchronization of data in a database, or real-time synchronization of data from tables in sharded databases, see Configure a synchronization task in Data Integration.
Best practices
FAQ
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.