Use a batch synchronization task to write data to a MongoDB data source - DataWorks

DataWorks Data Integration provides MongoDB Writer that allows you to write the data from other data sources to a MongoDB data source. This topic provides an example on how to use a batch synchronization task in Data Integration to synchronize data from a MaxCompute data source to a MongoDB data source.

Prerequisites

DataWorks is activated and a MaxCompute data source is added to a DataWorks workspace.
An exclusive resource group for Data Integration is purchased and configured. The resource group is used to run the batch synchronization task in this topic. For more information, see Create and use an exclusive resource group for Data Integration.
Note
You can also use new-version resource groups. For more information, see Create and use a resource group of the new version.

Make preparations

In this example, you must prepare a MongoDB data collection and a MaxCompute table for data synchronization.

Prepare a MaxCompute table and construct data for the table.

Create a partitioned table named test_write_mongo. The partition field is pt.

CREATE TABLE IF NOT EXISTS test_write_mongo(
    id STRING ,
    col_string STRING,
    col_int int,
    col_bigint bigint,
    col_decimal decimal,
    col_date DATETIME,
    col_boolean boolean,
    col_array string
) PARTITIONED BY (pt STRING) LIFECYCLE 10;

Add the value 20230215 for the partition field.

insert into test_write_mongo partition (pt='20230215')
values ('11','name11',1,111,1.22,cast('2023-02-15 15:01:01' as datetime),true,'1,2,3');

Check whether the partitioned table is correctly created.
```
SELECT*FROM test_write_mongo
WHEREpt='20230215';
```

Prepare a MongoDB data collection to which you want to write the data read from the partitioned MaxCompute table.
In this example, ApsaraDB for MongoDB is used and a data collection named test_write_mongo is created.
```
db.createCollection('test_write_mongo')
```

Configure a batch synchronization task

Step 1: Add a MongoDB data source

Add a MongoDB data source and make sure that a network connection is established between the data source and the exclusive resource group for Data Integration. For more information, see Add a MongoDB data source.

Step 2: Create and configure a batch synchronization task

Create a batch synchronization task on the DataStudio page in the DataWorks console and configure items such as the items related to the source and destination for the batch synchronization task. This step describes only some items that you must configure. For the other items, retain the default values. For more information, see Configure a batch synchronization task by using the codeless UI.

Establish network connections between the data sources and the exclusive resource group for Data Integration.
Select the MongoDB data source that you added in Step 1, the MaxCompute data source that you add, and the exclusive resource group for Data Integration. Then, test the network connectivity between the data sources and the resource group.

Select the data sources.

Select the partitioned MaxCompute table and MongoDB data collection that you prepare in the data preparation step. The following table describes how to configure the key parameters for the batch synchronization task.

Parameter

Description

WriteMode(overwrite or not)

Specifies whether to overwrite existing data in the MongoDB data collection. If you set this parameter to Yes, you must configure the ReplaceKey parameter.

WriteMode(overwrite or not)
- If you set the value to No, data is inserted into the MongoDB data collection as new data entries. The value No is the default value.
- If you set the value to Yes, you must configure the ReplaceKey parameter. This setting ensures that an existing data entry is overwritten by the new data entry that has the same primary key value.
ReplaceKey: the primary key for each data record. Data is overwritten based on the primary key. You can specify only one primary key column. In most cases, the primary key in the MongoDB data collection is used.

Note

If you set the WriteMode(overwrite or not) parameter to Yes, and you specify a field other than the _id field as the primary key, an error that is similar to the following error may occur when the task is run: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2". The reason is that the value of the _id field does not match the value of the replaceKey parameter for some of the data that is written to the destination MongoDB data collection. For more information, see Error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2".

Statement Run Before Writing

The SQL statement that you want to execute before data synchronization. You can configure the SQL statement in the JSON format, and configure the type and json properties.

type: required. Valid values: remove and drop. The values must in lowercase letters.
json:
- If type is set to remove, this property is required. You must configure the property based on the syntax of the standard MongoDB query operations. For more information, see Query Documents.
- If type is set to drop, this property is not required.

Configure field mappings.
If a MongoDB data source is added, the method of mapping fields in a row of the source to the fields in the same row of the destination is used by default. You can also click the icon to manually edit fields in the source table. The following sample code provides an example on how to edit fields in the source table:
```
{"name":"id","type":"string"}
{"name":"col_string","type":"string"}
{"name":"col_int","type":"long"}
{"name":"col_bigint","type":"long"}
{"name":"col_decimal","type":"double"}
{"name":"col_date","type":"date"}
{"name":"col_boolean","type":"bool"}
{"name":"col_array","type":"array","splitter":","}
```
After you edit the fields, the new mappings between the source fields and destination fields are displayed on the configuration tab of the task.

Step 3: Commit and deploy the batch synchronization task

If you use a workspace in standard mode and you want to periodically schedule the batch synchronization task in the production environment, you can commit and deploy the task to the production environment. For more information, see Deploy nodes.

Step 4: Run the batch synchronization task and view the synchronization result

After you complete the preceding configurations, you can run the batch synchronization task. After the running is complete, you can view the data synchronized to the MongoDB data collection. 结果数据2

Appendix: Data type conversion during data synchronization

Values of the type parameter

The following data types are supported for the type parameter: INT, LONG, DOUBLE, STRING, BOOL, DATE, and ARRAY.

Data written to the MangoDB data collection when type is set to ARRAY

If you set the type parameter to ARRAY, you must configure the splitter property. This way, data can be written to the MongoDB data collection as arrays. Example:

The source data is a string: a,b,c.
You set the type parameter to ARRAY and the splitter property to, for the batch synchronization task.
The data written to the destination is ["a","b","c"] when the task is run.