×
Community Blog Reduce Service Use by 80% by Migrating from MongoDB to Elasticsearch

Reduce Service Use by 80% by Migrating from MongoDB to Elasticsearch

This article discusses the why's and how's of migrating from MongoDB to Elasticsearch.

Released by ELK Geek

MongoDB and Elasticsearch are two popular databases and have been subject to many debates between supporters of the two technologies. However, this article only represents the author's individual experience rather than opinions of any group. This article covers the following two topics:

  • Why you should migrate from MongoDB to Elasticsearch
  • How to migrate from MongoDB to Elasticsearch

Background

MongoDB is positioned to compete with relational databases. However, in almost all projects, the data of core business systems is still stored in traditional relational databases rather than MongoDB.

1. Project Background

As a player in the logistics and express delivery industry, the company has an extensive and complex business system. Massive business data is produced every day by a large number of users. The business data undergoes a variety of status changes during its lifecycle. To facilitate log tracking and analysis, the system operation logging project was established, and based on the original average daily data volume, MongoDB was selected to store operation log data.

The operation logging system records the following two types of data:

1) Primary change data that describes who performs operations, what operations are performed, which system modules are operated on, when the operations are performed, what data numbers are involved, and what operation tracking numbers are assigned.

{
  "dataId": 1, 
  "traceId": "abc",        
  "moduleCode": "crm_01",           
  "operateTime": "2019-11-11 12:12:12", 
  "operationId": 100,
  "operationName": "Zhang San",
  "departmentId": 1000,
  "departmentName": "Account Department",
  "operationContent": "Visit clients"
}

2) Secondary change data that describes the actual values before and after a change. Changes to multiple fields in one row of data result in multiple data entries. Therefore, a large number of such data entries are recorded.

[
  {
    "dataId": 1,
    "traceId": "abc",
    "moduleCode": "crm_01",
    "operateTime": "2019-11-11 12:12:12",
    "operationId": 100,
    "operationName": "Zhang San",
    "departmentId": 1000,
    "departmentName": "Account Department",
    "operationContent": "Visit clients",
    
    "beforeValue": "20",
    "afterValue": "30",
    "columnName": "customerType"
  },
  {
    "dataId": 1,
    "traceId": "abc",
    "moduleCode": "crm_01",
    "operateTime": "2019-11-11 12:12:12",
    "operationId": 100,
    "operationName": "Zhang San",
    "departmentId": 1000,
    "departmentName": "Account Department",
    "operationContent": "Visit clients",
    
    "beforeValue": "2019-11-02",
    "afterValue": "2019-11-10",
    "columnName": "lastVisitDate"
  }
]

2. Project Architecture

The project architecture is as follows:

  1. When you add or edit data in the business system, an operation log record is generated and sent to a Kafka cluster, using the dataId field as the key.
  2. The data you add or edit is stored in a MySQL database.
  3. A Canal cluster subscribes to the MySQL cluster and configures the databases and tables monitored according to the modules of the business system.
  4. The Canal cluster sends the modified business data to the Kafka cluster, using the dataId field as the key.
  5. The operation log system obtains primary and secondary record data from the Kafka cluster.
  6. The operation log system writes the data to MongoDB and requires reverse query capabilities.

1
Figure: Workflow of the operation logging system

MongoDB Architecture

Cluster architecture:

1) Servers are configured with 8-core 32 GB memory and 500 GB solid state drives (SSDs).

2) Three router servers are deployed.

3) Three configuration servers are deployed.

4) Nine shard servers are deployed.

5) Three shards are designed for primary operation records.

6) Three shards are designed for secondary operation records.

2

Issues

Fans of MongoDB may suspect that we were not using MongoDB properly, our O&M capabilities were insufficient, or we were under the influence of an Elasticsearch expert. In fact, we shifted from MongoDB to Elasticsearch based on actual scenario requirements rather than any technical bias. The reasons are described in the following sections.

1. Search and Query

1) MongoDB uses B-Tree as its index structure. This index gives the highest priority to the leftmost alternative and is effective only when the query order is consistent with the order of the index fields. This is beneficial for some applications, but can also be fatal in the complex business scenarios we are faced with today.

2) Queries for operation log records in the business system involve many filter criteria that can be arbitrarily combined. This is not supported by MongoDB, or by any relational database. To support this, you have to create a lot of B+Tree index combinations, which is not practical.

3) In addition, primary and secondary records contain a lot of character-type data. Therefore, both exact query and full-text search are required to query this kind of data. In these respects, MongoDB provides inadequate functions and poor performance, leading to frequent timeouts in business system queries. By contrast, Elasticsearch is a very suitable solution.

2. Technology Stack Maturity

1) In terms of sharding and replica implementation, you have to bind collection data in MongoDB to specific instances during design. This means you have to finalize the allocation of nodes for shards and replica sets during cluster configuration. This is essentially the same as sharding traditional relational databases. In fact, this mode is widely used by the clusters of many data products, such as Redis-cluster and ClickHouse. By contrast, Elasticsearch clusters are not bound to shards or replica sets and can be adjusted as needed. In addition, Elasticsearch makes it easy to adopt different performance configurations for different nodes.

2) Operation logs accumulate rapidly, with over 10 million new entries every day. As a result, you have to scale out your servers at short time intervals, and this process is much more complicated than Elasticsearch.

3) Each MongoDB collection contains more than 1 billion data records. As a result, the performance of a simple quest in MongoDB is inferior to a query by inverted indexes in Elasticsearch.

4) The company has different levels of experience with the Elasticsearch and MongoDB technology stacks. Elasticsearch is widely used in many projects including core projects, so the company is accustomed to the technologies and O&M of Elasticsearch. By contrast, MongoDB is suitable for nothing apart from core business scenarios. However, no one wants to risk using MongoDB in core projects, leaving MongoDB in a very embarrassing situation.

3. Same Document Formats

MongoDB and Elasticsearch are both document-oriented databases. Binary Serialized Document Format (BSON) is similar to JavaScript Object Notation (JSON), and the _objectid field works in the same way as the _id field. Therefore, the data model essentially remains unchanged when primary and secondary data is migrated to the Elasticsearch platform.

Migration Solution

Migration between heterogeneous data systems consists involves two parts:

1) Migrate the application system at the upper layer. This involves shifting from MongoDB-oriented syntax rules to Elasticsearch-oriented ones.

2) Migrate data at the lower layer from MongoDB to Elasticsearch.

1. Evaluate Elastic Capacity

The original MongoDB cluster consisted of 15 servers, of which nine were data servers. How many servers are needed after migration to the Elasticsearch cluster? Use the following simple calculation. Assume that a MongoDB collection in the production environment contains 1 billion data entries. First, synchronize 1 million data entries from MongoDB to Elasticsearch in the test environment. If the 1 million data entries occupy 10 GB of disk space, 1 TB of disk space is required in the production environment. Then, add some redundancy based on expected business growth. According to this preliminary evaluation, the Elasticsearch cluster needs three servers configured with 8-core 16 GB memory and 2 TB hard disk drives (HDDs). Consequently, the number of servers is reduced from 15 to 3, significantly reducing configuration costs.

2. Set Elastic Index Rules

System operation logs are time-series data that requires no modification once fully written. Operation log records are queried most frequently in the month when they are generated and are seldom queried as historical data afterward. According to evaluation results, core data indexes are created and generated on a monthly basis. You must specify an operation time period when you make a service query so that the backend can identify query indexes based on the specified time period. Elasticsearch APIs support multi-index matching queries, making full use of Elasticsearch's features to merge queries that span multiple months. It is enough to create and generate non-core data indexes on a yearly basis.

3

3. Design Core Implementation Logic

Elasticsearch is not a relational database and does not have a transaction mechanism. All the data of the operation log system is sourced from the Kafka cluster. Data is consumed in sequence according to a mechanism. Therefore, pay special attention to the following two scenarios:

  • When primary data is synchronized to the operation log system before secondary data, the primary data records and Binlog field data are pieced together first when the secondary data is written.
  • When secondary data is synchronized to the operation log system before primary data, relevant index fields in secondary indexes are updated based on the primary data.

In Elasticsearch, index data is updated according to a near-real-time refresh mechanism. Therefore, data cannot be queried through search APIs immediately after it is submitted. In this case, how can we update primary record data to secondary records? In addition, the same data ID or trace ID may be used in multiple primary records due to a lack of standardization across business departments.

Primary data is correlated to secondary data by the dataId and traceId fields. Therefore, a data update based on the update_by_query command will be invalid and incorrect if primary data and secondary data arrive at the operation log system at the same time. In addition, primary data and secondary data may be correlated to each other on a many-to-many basis, and therefore the dataId and traceId fields are not the unique identifiers of a record.

In fact, Elasticsearch is also a NoSQL database that supports key-value caching. Therefore, you can create an Elasticsearch index to serve as an intermediate cache that caches primary data or secondary data, whichever arrives first. The _id element of the index consists of the dataId and traceId fields. This allows you to find the ID of the primary or secondary data record by using an intermediate ID. Most index data models are structured as follows, where the detailId field is the _id array record of the secondary index.

{
  "dataId": 1,
  "traceId": "abc",
  "moduleCode": "crm_01",
  "operationId": 100,
  "operationName": "Zhang San",
  "departmentId": 1000,
  "departmentName": "Account Department",
  "operationContent": "Visit clients",
  "detailId": [
    1,
    2,
    3,
    4,
    5,
    6
  ]
}

As mentioned above, primary records and secondary records are both stored on a Kafka shard. This allows you to call the following core Elasticsearch APIs to pull data in batches:

# Query records in secondary indexes in bulk
_mget 
# Insert in bulk
bulk
# Delete intermediate temporary indexes in bulk
_delete_by_query 

Migration Procedure

1. Migrate Data

Here, we use DataX as the data synchronization tool based on the following factors:

  • Historical data: Operation log records are historical data that rarely need further modification once generated, making them similar to offline data.
  • Non-continuous migration: When the project is completed, the original MongoDB cluster is completely terminated and no secondary migration is required.
  • Data volume: The original MongoDB operation logs amount to billions of entries.

Therefore, the migration must be processed at an appropriate speed. An excessively fast migration leads to performance problems for the MongoDB cluster, while an excessively slow migration prolongs the project and increases O&M costs and complexity. If this is not an issue, you can select Hadoop as an intermediate platform for migration.

  • Scenario-based modifications of the DataX source code: DataX allows you to modify the source code to suit different scenarios, such as date type conversion and generation or mapping of primary index key _id fields. It also supports repeated synchronization.
  • Multi-instance and multi-thread operations in parallel: Synchronization of primary data and synchronization of secondary data are both deployed on multiple instances, and each instance is configured with multiple channels.

4

2. Configure Migration Indexes

Run the following command to temporarily modify some index settings, and then revert them when data is synchronized:

  "index.number_of_replicas": 0,
  "index.refresh_interval": "30s",
  "index.translog.flush_threshold_size": "1024M"
  "index.translog.durability": "async",
  "index.translog.sync_interval": "5s"

3. Migrate Applications

The operation log project is built by using Springboot, with the following custom configuration items added:

# Write flag mongodb in applications
writeflag.mongodb: true
# Write flag elasticsearch in applications
writeflag.elasticsearch: true

Project-based modifications:

  • When the project goes online for the first time, set the two write tags to true to enable double-write for MongoDB and Elasticsearch.
  • Two different read interfaces are provided for flexible frontend switching.
  • When data migration is completed and no difference exists, revert the values of the flags.

5

Summary

1. Benefits of Migration

By replacing a MongoDB storage database with Elasticsearch, we can use three Elasticsearch servers to do the work of 15 MongoDB servers, significantly reducing corporate costs each month. In addition, the query performance improved more than 10 times over, and the system provides better support for a variety of queries. This is a great help to the business department, O&M team, and company leaders.

2. Lessons Learned

The whole project took several months, and many colleagues participated in its design, R&D, data migration, testing, data verification, and stress testing. This technical solution was not designed all at once. Instead, it is the result of trial and error. Elasticsearch has many excellent technical features. Only flexible use can maximize its power.

About the Author

Li Meng is an Elasticsearch Stack user and a certified Elasticsearch engineer. Since his first explorations into Elasticsearch in 2012, he has gained in-depth experience in the development, architecture, and operation and maintenance (O&M) of the Elastic Stack and has carried out a variety of large and mid-sized projects. He provides enterprises with Elastic Stack consulting, training, and tuning services. He has years of practical experience and is an expert in various technical fields, such as big data, machine learning, and system architecture.

Declaration: This article is reproduced with authorization from Li Meng, the original author. The author reserves the right to hold users legally liable in the case of unauthorized use.

0 0 0
Share on

Alibaba Clouder

2,599 posts | 762 followers

You may also like

Comments

Alibaba Clouder

2,599 posts | 762 followers

Related Products