DataWorks Data Integration provides MongoDB Writer that allows you to write the data from other data sources to a MongoDB data source. This topic provides an example on how to use a batch synchronization task in Data Integration to synchronize data from a MaxCompute data source to a MongoDB data source.
Prerequisites
DataWorks is activated and a MaxCompute data source is added to a DataWorks workspace.
An exclusive resource group for Data Integration is purchased and configured. The resource group is used to run the batch synchronization task in this topic. For more information, see Create and use an exclusive resource group for Data Integration.
NoteYou can also use new-version resource groups. For more information, see Create and use a resource group of the new version.
Make preparations
In this example, you must prepare a MongoDB data collection and a MaxCompute table for data synchronization.
Prepare a MaxCompute table and construct data for the table.
Create a partitioned table named
test_write_mongo
. The partition field ispt
.CREATE TABLE IF NOT EXISTS test_write_mongo( id STRING , col_string STRING, col_int int, col_bigint bigint, col_decimal decimal, col_date DATETIME, col_boolean boolean, col_array string ) PARTITIONED BY (pt STRING) LIFECYCLE 10;
Add the value
20230215
for the partition field.insert into test_write_mongo partition (pt='20230215') values ('11','name11',1,111,1.22,cast('2023-02-15 15:01:01' as datetime),true,'1,2,3');
Check whether the partitioned table is correctly created.
SELECT*FROM test_write_mongo WHEREpt='20230215';
Prepare a MongoDB data collection to which you want to write the data read from the partitioned MaxCompute table.
In this example, ApsaraDB for MongoDB is used and a data collection named
test_write_mongo
is created.db.createCollection('test_write_mongo')
Configure a batch synchronization task
Step 1: Add a MongoDB data source
Add a MongoDB data source and make sure that a network connection is established between the data source and the exclusive resource group for Data Integration. For more information, see Add a MongoDB data source.
Step 2: Create and configure a batch synchronization task
Create a batch synchronization task on the DataStudio page in the DataWorks console and configure items such as the items related to the source and destination for the batch synchronization task. This step describes only some items that you must configure. For the other items, retain the default values. For more information, see Configure a batch synchronization task by using the codeless UI.
Establish network connections between the data sources and the exclusive resource group for Data Integration.
Select the MongoDB data source that you added in Step 1, the MaxCompute data source that you add, and the exclusive resource group for Data Integration. Then, test the network connectivity between the data sources and the resource group.
Select the data sources.
Select the partitioned MaxCompute table and MongoDB data collection that you prepare in the data preparation step. The following table describes how to configure the key parameters for the batch synchronization task.
Parameter
Description
WriteMode(overwrite or not)
Specifies whether to overwrite existing data in the MongoDB data collection. If you set this parameter to Yes, you must configure the
ReplaceKey
parameter.WriteMode(overwrite or not)
If you set the value to No, data is inserted into the MongoDB data collection as new data entries. The value No is the default value.
If you set the value to Yes, you must configure the
ReplaceKey
parameter. This setting ensures that an existing data entry is overwritten by the new data entry that has the same primary key value.
ReplaceKey
: the primary key for each data record. Data is overwritten based on the primary key. You can specify only one primary key column. In most cases, the primary key in the MongoDB data collection is used.
NoteIf you set the
WriteMode(overwrite or not)
parameter to Yes, and you specify a field other than the _id field as the primary key, an error that is similar to the following error may occur when the task is run:After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"
. The reason is that the value of the _id field does not match the value of the replaceKey parameter for some of the data that is written to the destination MongoDB data collection. For more information, see Error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2".Statement Run Before Writing
The SQL statement that you want to execute before data synchronization. You can configure the SQL statement in the JSON format, and configure the
type
andjson
properties.type
: required. Valid values:remove
anddrop
. The values must in lowercase letters.json
:If
type
is set toremove
, this property is required. You must configure the property based on the syntax of the standard MongoDB query operations. For more information, see Query Documents.If
type
is set todrop
, this property is not required.
Configure field mappings.
If a MongoDB data source is added, the method of mapping fields in a row of the source to the fields in the same row of the destination is used by default. You can also click the icon to manually edit fields in the source table. The following sample code provides an example on how to edit fields in the source table:
{"name":"id","type":"string"} {"name":"col_string","type":"string"} {"name":"col_int","type":"long"} {"name":"col_bigint","type":"long"} {"name":"col_decimal","type":"double"} {"name":"col_date","type":"date"} {"name":"col_boolean","type":"bool"} {"name":"col_array","type":"array","splitter":","}
After you edit the fields, the new mappings between the source fields and destination fields are displayed on the configuration tab of the task.
Step 3: Commit and deploy the batch synchronization task
If you use a workspace in standard mode and you want to periodically schedule the batch synchronization task in the production environment, you can commit and deploy the task to the production environment. For more information, see Deploy nodes.
Step 4: Run the batch synchronization task and view the synchronization result
After you complete the preceding configurations, you can run the batch synchronization task. After the running is complete, you can view the data synchronized to the MongoDB data collection.
Appendix: Data type conversion during data synchronization
Values of the type parameter
The following data types are supported for the type parameter: INT
, LONG
, DOUBLE
, STRING
, BOOL
, DATE
, and ARRAY
.
Data written to the MangoDB data collection when type is set to ARRAY
If you set the type parameter to ARRAY, you must configure the splitter property. This way, data can be written to the MongoDB data collection as arrays. Example:
The source data is a string:
a,b,c
.You set the type parameter to
ARRAY
and the splitter property to,
for the batch synchronization task.The data written to the destination is
["a","b","c"]
when the task is run.