By Guiyuan
This article is compiled from the research: Principle Analysis and Application of Alibaba Cloud Realtime Compute for Apache Flink: Deep Exploration into MongoDB Schema Inference. The research is conducted by Guiyuan of the Alibaba Cloud Flink team. The content is mainly divided into the following four parts:
MongoDB is a document-oriented non-relational database that supports semi-structured data storage. It is also a distributed database providing two cluster deployment modes: replica set and shard set. MongoDB is highly available and horizontally scalable, making it suitable for large-scale data storage.
MongoDB uses a weakly structured storage mode and supports flexible data structures and a wide range of data types. It is suitable for business scenarios such as JSON documents, tags, snapshots, geographic locations, and content storage. Its naturally distributed architecture provides an out-of-the-box sharding mechanism and automatic rebalance capability, which is suitable for large-scale data storage. Additionally, MongoDB also provides the distributed grid file storage feature, GridFS, which is suitable for storing large files such as images, audio, and videos.
Flink CDC is a database-based log CDC (Change Data Capture) technology that implements full and incremental integrated read capabilities. With Flink's excellent pipeline capabilities and rich upstream and downstream ecosystems, Flink CDC supports real-time capture and processing of a variety of data changes and outputs them to the downstream. MongoDB is one of the supported databases. The main features supported include:
MongoDB CDC Community Edition uses the Change Streams introduced in MongoDB 3.6 to achieve MongoDB CDC Table Source by converting change streams into Flink upsert changelogs. In MongoDB versions earlier than 6.0, the data of the documents before the change and the deleted documents are not provided by default. You can only use this information to implement the upsert semantics shown in the following figure.
The new Pre- and Post-Image feature of MongoDB 6.0 provides a more efficient solution: as long as the changeStreamPreAndPostImages feature is enabled, MongoDB will record the complete state of the document before and after each change in a special collection. MongoDB CDC allows you to read these records and generate a complete event stream. This eliminates the dependency on ChangelogNormalize nodes. The community and Realtime Compute for Apache Flink support this feature.
MongoDB CDC Community Edition is very powerful as a pure engine. However, as a commercial product, it still has a shortcoming, that is, it cannot support schema changes.
As a NoSQL database, MongoDB does not have a fixed schema requirement, and schema changes are common. However, MongoDB CDC Community Edition can only support fixed schemas and cannot support schema changes. In addition, it requires users to manually define the schemas of the table, which is not convenient.
To address the preceding deficiencies, Realtime Compute for Apache Flink provides MongoDB catalogs to support schema inference for MongoDB without the need to manually define schemas. In addition, you can use the CTAS or CDAS statement to synchronize schema changes of upstream tables to downstream tables while synchronizing MongoDB data in real time. This improves the efficiency of creating tables and maintaining schema changes.
MongoDB schema inference is implemented by using MongoDB catalogs. MongoDB catalogs infer the schema of collections and can be used as Flink source tables, dimension tables, or result tables without the need to manually specify DDL statements. Schema inference includes the following steps:
MongoDB catalogs sample 100 documents from the collection by default. If the number of documents in the collection is less than this value, all data in the collection is obtained.
The amount of sampled data can be set through the configuration max.fetch.records provided by MongoDB catalogs.
In MongoDB, each document is a BSON document. Compared with JSON, the BSON type is a superset of the JSON type. Compared with JSON, the BSON type additionally supports types such as DateTime and Binary. When you parse the schema of a single BSON document, the BSON type will correspond to the Flink SQL type one-to-one. For a document type nested in a BSON document, it is parsed as STRING by default.
To better resolve nested document types, MongoDB catalogs provide the configuration scan.flatten-nested-columns.enabled that can be used to recursively resolve fields in a document type. Assume that the initial BSON document is as follows:
{
"unnested": "value",
"nested": {
"col1": 99,
"col2": true
}
}
If you set the scan.flatten-nested-columns.enabled to false (by default), the schema contains two columns:
Column name | Flink SQL data type |
unnested | STRING |
nested | STRING |
If you set the scan.flatten-nested-columns.enabled to true, the schema contains three columns:
Column name | Flink SQL data type |
unnested | STRING |
nested.col1 | INT |
nested.col2 | BOOLEAN |
In addition, MongoDB catalogs provide the configuration scan.primitive-as-string to map all BSON basic data types to STRING.
After you obtain a set of BSON documents, the MongoDB catalog parses the BSON documents one by one and merges the parsed physical columns based on the following rules. The final schema is used as the schema of the entire collection. The following are the merging rules:
For example, assume a collection that contains the following three pieces of data:
{
"_id": {
"$oid": "100000000000000000000101"
},
"name": "Alice",
"age": 10,
"phone": {
"mother": "111",
"fatehr": "222"
}
}
{
"_id": {
"$oid": "100000000000000000000102"
},
"name": "Bob",
"age": 20,
"phone": {
"mother": "333",
"fatehr": "444"
}
"address": ["Shanghai"],
"desc": 1024
}
{
"_id": {
"$oid": "100000000000000000000103"
},
"name": "John",
"age": 30,
"phone": {
"mother": "555",
"fatehr": "666"
}
"address": ["Shanghai"],
"desc": "test value"
}
In the above three BSON documents, the last two have address and desc fields that the first one does not. These two fields will be merged into the final schema after schema merging. The desc field types of the latter two documents are different. When the schema of a single document is parsed, the two fields are mapped to the INT and STRING of the Flink SQL type, respectively. According to the preceding rules for type merging during schema merging, the desc field type is eventually inferred to STRING.
Therefore, the final schema of the MongoDB catalog is as follows:
Column name | Flink SQL data type | Description |
_id | STRING NOT NULL | The primary key field |
name | STRING | |
age | INT | |
phone | STRING | |
address | STRING | |
desc | STRING | Types merged into STRING |
In MongoDB, each document has a special field_id, which is used to uniquely identify a document in a collection. This field is automatically generated when the document is created.
MongoDB catalogs use the _id column as the primary key and add the default primary key constraint to ensure that data is not duplicated.
When you use a table in a MongoDB catalog as a CDC source, schema changes such as adding or changing field types may occur in the data in the collection. When you use a connector to process data, you must consider the schema evolution.
MongoDB CDC connector, the schema inferred from the MongoDB catalog is used as the initial schema. When you read an oplog, perform the following steps:
The CTAS statement allows you to synchronize full and incremental data from a source table to a result table. When you synchronize data, you can also synchronize schema changes from the source table to the result table in real time. The CDAS statement supports real-time data synchronization at the database level and synchronization of schema changes.
Before you use the CTAS or CDAS statement to synchronize data, you must create a MongoDB catalog:
CREATE CATALOG <yourcatalogname> WITH(
'type'='mongodb',
'default-database'='<dbName>',
'hosts'='<hosts>',
'scheme'='<scheme>',
'username'='<username>',
'password'='<password>',
'connection.options'='<connectionOptions>',
'max.fetch.records'='100',
'scan.flatten-nested-columns.enable'='<flattenNestedColumns>',
'scan.primitive-as-string'='<primitiveAsString>'
);
Example:
After you create a MongoDB catalog, you can use one of the following methods to synchronize data and schemas:
CREATE TABLE IF NOT EXISTS `${target_table_name}`
WITH(...)
AS TABLE `${mongodb_catalog}`.`${db_name}`.`${collection_name}`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;
Example:
BEGIN STATEMENT SET;
CREATE TABLE IF NOT EXISTS `some_catalog`.`some_database`.`some_table0`
AS TABLE `mongodb-catalog`.`database`.`collection0`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;
CREATE TABLE IF NOT EXISTS `some_catalog`.`some_database`.`some_table1`
AS TABLE `mongodb-catalog`.`database`.`collection1`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;
CREATE TABLE IF NOT EXISTS `some_catalog`.`some_database`.`some_table2`
AS TABLE `mongodb-catalog`.`database`.`collection2`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;
END;
Example:
CREATE DATABASE IF NOT EXISTS `some_catalog`.`some_database`
AS DATABASE `mongo-catalog`.`database` INCLUDING TABLE 'table-name'
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;
Example:
The following example uses Realtime Compute for Apache Flink:
Assume that we need to synchronize the data and schemas of all collections in a single database in MongoDB to Hologres. New fields may appear in the data in MongoDB.
The name of the MongoDB database is guiyuan_cdas_test, which contains two collections named test_coll_0 and test_coll_1. We want to synchronize data to the database of Hologres with the same name: cdas_test.
The initial data of the two collections in MongoDB is as follows:
After you create MongoDB and Hologres catalogs, write a CDAS job on the SQL development page. MongoDB catalogs will infer the schema of the collection, so you do not need to manually define the DDL of the table.
After the deployment is run, you can see that the guiyuan_cdas_test database has been automatically created in the Hologres database and the initial data of the two tables has been synchronized:
At this point, a data entry that contains the address field is inserted into test_coll_0 and a data entry that contains the phone field is inserted into test_coll_1.
Observe the Hologres table. You can see that both tables have synchronized the new data and schemas:
Flink CDC implements MongoDB CDC Source based on MongoDB's change streams, supporting full-incremental integrated data synchronization for MongoDB. Realtime Compute for Apache Flink uses MongoDB catalogs to infer MongoDB schemas. With the use of the CTAS or CDAS statements, Realtime Compute for Apache Flink can synchronize schema changes while synchronizing data. When a schema changes, there is no need to modify the Flink job to synchronize the schema to the downstream storage, which greatly improves the flexibility and convenience of data integration.
Understanding Stream Processing: Real-Time Data Analysis and Use Cases
Using Apache Paimon + StarRocks High-speed Batch and Streaming Lakehouse Analysis
150 posts | 43 followers
FollowApache Flink Community - May 28, 2024
Apache Flink Community - May 10, 2024
Apache Flink Community China - January 11, 2022
Alibaba Cloud Indonesia - March 23, 2023
Apache Flink Community - May 30, 2024
Apache Flink Community - April 18, 2024
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Apache Flink Community