DataWorks is an important platform as a service (PaaS) provided by Alibaba Cloud. DataWorks supports multiple computing engines and storage engines. This topic describes how to use DataWorks to migrate offline data from ApsaraDB for MongoDB to LindormTable.

Background information

For more information about DataWorks, see What is DataWorks?.

Precautions

To migrate offline data from ApsaraDB for MongoDB to LindormTable, you need to unnest the nested JSON fields in the offline data. Take note that you do not need to convert the data.
Note Perform the following steps if you want to process data during the migration process. For example, perform the following steps if you want to perform MD5 hashing on the primary key during the migration process:
  1. Use DataWorks to migrate the data from ApsaraDB for MongoDB to MaxCompute. MaxCompute is also known as ODPS.
  2. Execute SQL statements to process the data in MaxCompute.
  3. Use DataWorks to migrate the data from MaxCompute to LindormTable.

Preparations

Before you migrate offline data from ApsaraDB for MongoDB to LindormTable, complete the following tasks:
  • Prepare the data to be migrated in ApsaraDB for MongoDB:
    {
       "id" : ObjectId("624573dd7c0e2eea4cc8****"),
       "title" : "ApsaraDB for MongoDB tutorial",
       "description" : "ApsaraDB for MongoDB is a NoSQL database",
       "by" : "beginner tutorial",
       "url" : "http://www.runoob.com",
       "map" : {
            "a" : "mapa",
            "b" : "mapb"
        },
       "likes" : 100
    }
  • Prepare the schema data in LindormTable:
    CREATE TABLE t1 (
      title varchar not null primary key,
      desc varchar,
      by varchar,
      url varchar,
      a varchar,
      b varchar,
      likes int);
  • Use the Data Integration service of DataWorks to configure a DataX task. For more information, see Use DataWorks to configure synchronization tasks in DataX.

Procedure

  1. Add an ApsaraDB for MongoDB data source in the DataWorks console. For more information, see Add a MongoDB data source.
  2. Configure a batch synchronization node by using the code editor. For more information, see Configure a batch synchronization node by using the code editor.
    1. Create a workflow.
      1. Log on to the DataWorks console.
      2. In the left-side navigation pane, click Workspaces.
      3. In the top navigation bar, select the region where the workspace resides, find the workspace, and then click DataStudio.
      4. On the DataStudio page, move the pointer over the Create a table icon and select Create Workflow.
      5. In the Create Workflow dialog box, specify Workflow Name and Description.
        Note The name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
      6. Click New.
    2. Create a batch synchronization node.
      1. Click the newly created workflow and right-click Data Integration.
      2. Choose Create Node > Offline synchronization.
      3. In the Create Node dialog box, specify Name and Path.
        Note The node name must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).
      4. Click Submit.
    3. Configure the reader and writer of the batch synchronization node.
      1. On the node configuration tab that appears, click the Conversion script icon in the top toolbar. Switch to Code Editor
      2. In the Tips message, click OK to open the code editor.
      3. The code editor has generated basic reader and writer settings. You can manually configure the data sources of the reader and writer for the batch synchronization node and specify the information about the tables to be synchronized. The following sample code provides an example:
        Note
        • For more information about the parameters of the ApsaraDB for MongoDB reader, see MongoDB Reader.
        • For more information about the parameters of the Lindorm writer, see Lindorm Writer.
        {
            "type": "job",
            "version": "2.0",
            "steps": [
                {
                    "stepType": "mongodb",
                    "parameter": {
                        "datasource": "test_mongo",   //The name of the ApsaraDB for MongoDB data source. 
                        "column": [
                            {
                                "name": "title",
                                "type": "string"
                            },
                            {
                                "name": "description",
                                "type": "string"
                            },
                            {
                                "name": "by",
                                "type": "string"
                            },
                            {
                                "name": "url",
                                "type": "string"
                            },
                            {
                                "name": "map.a",
                                "type": "document.string"
                            },
                            {
                                "name": "map.b",
                                "type": "document.string"
                            },
                            {
                                "name": "likes",
                                "type": "int"
                            }
                        ],
                        "collectionName": "testdatax"
                    },
                    "name": "Reader",
                    "category": "reader"
                },
                {
                    "stepType": "lindorm",
                    "parameter": {
                        "configuration":  {
                            "lindorm.client.seedserver": "ld-xxxx-proxy-lindorm.lindorm.rds.aliyuncs.com:30020",
                            "lindorm.client.username": "root",
                            "lindorm.client.namespace": "test",
                            "lindorm.client.password": "root"
                        },
                        "nullMode": "skip",
                        "datasource": "",
                        "writeMode": "api",
                        "envType": 1,
                        "columns": [
                            "title",
                            "desc",
                            "by",
                            "url",
                            "a",
                            "b",
                            "likes"
                        ],
                        "dynamicColumn": "false",
                        "table": "t1",
                        "encoding": "utf8"
                    },
                    "name": "Writer",
                    "category": "writer"
                }
            ],
            "setting": {
                "executeMode": null,
                "errorLimit": {
                    "record": ""
                },
                "speed": {
                    "concurrent": 2,
                    "throttle": false
                }
            },
            "order": {
                "hops": [
                    {
                        "from": "Reader",
                        "to": "Writer"
                    }
                ]
            }
        }
      4. After you configure the batch synchronization node, save the node configurations and click the Run icon in the upper-left part of the code editor. On the Runtime Log tab, you can view the progress of the migration task.