Troubleshoot high memory usage issues on an ApsaraDB for MongoDB instance - ApsaraDB for MongoDB

Memory usage is a key metric to monitor an ApsaraDB for MongoDB instance. This topic describes how to view the memory usage details of an ApsaraDB for MongoDB instance and how to troubleshoot high memory usage issues on the instance.

Background information

ApsaraDB for MongoDB processes load binary files and dependent system library files to the memory. ApsaraDB for MongoDB processes also allocate and release memory in the connection management, request processing, and storage engine of clients. By default, ApsaraDB for MongoDB uses TCMalloc provided by Google as a memory allocator. A large amount of memory is consumed by the WiredTiger storage engine, client connections, and request processing.

Access method

For a sharded cluster instance, the memory usage on each shard node is the same as that on a replica set instance. The ConfigServer nodes store only configuration metadata. The memory usage on mongos nodes is affected by aggregated result sets, the number of connections, and the size of metadata.

For a replica set instance, you can use the following methods to view the memory usage:

View memory usage in monitoring charts
An ApsaraDB for MongoDB replica set instance consists of multiple node roles. Each node role can correspond to one or more physical nodes. The ApsaraDB for MongoDB replica set instance consists of a primary node that supports read/write operations, one or more high-availability secondary nodes, a hidden node, and one or more optional read-only nodes.
On the Monitoring Data page of an ApsaraDB for MongoDB replica set instance in the ApsaraDB for MongoDB console, you can view the memory usage of the instance in monitoring charts.
View memory usage by running commands
To view and analyze the memory usage on an ApsaraDB for MongoDB replica set instance, run the db.serverStatus().mem command in the mongo shell. Sample result:
```
{ "bits" : 64, "resident" : 13116, "virtual" : 20706, "supported" : true }
// resident indicates the physical memory that is consumed by mongos nodes. Unit: MB. 
// virtual indicates the virtual memory that is consumed by mongos nodes. Unit: MB.
```
Note
For more information about the serverStatus command, see serverStatus.

Common causes

High memory usage of the WiredTiger storage engine

The WiredTiger storage engine consumes the largest amount of memory of an ApsaraDB for MongoDB instance. To ensure compatibility and security, ApsaraDB for MongoDB sets the CacheSize parameter to 60% of the actual memory of an instance. For more information, see Specifications.

If the size of cached data exceeds 95% of the specified cache size, high memory usage occurs. In this case, the threads that originally handle user requests also evict dirty pages for protection purpose. As a result, user request congestion occurs. For more information, see Eviction parameters.

You can use the following methods to view the memory usage of the WiredTiger storage engine:

Run the db.serverStatus().wiredTiger.cache command in the mongo shell. The value of the bytes currently in the cache parameter is the memory size. Sample output:

{
   ......
   "bytes belonging to page images in the cache":6511653424,
   "bytes belonging to the cache overflow table in the cache":65289,
   "bytes currently in the cache":8563140208,
   "bytes dirty in the cache cumulative":NumberLong("369249096605399"),
   ......
}

On the Dashboard page of an instance in the DAS console, view the percentage of dirty data in the WiredTiger cache. For more information, see Performance trends.
Use the mongostat tool of ApsaraDB for MongoDB to view the percentage of dirty data in the WiredTiger cache. For more information, see mongostat.

High memory usage of connections and requests

If a large number of connections to the instance are established, memory may be consumed due to the following reasons:

Each connection has a thread that handles requests in the background. Each thread can occupy up to 1 MB of stack space. In most cases, dozens to hundreds of KB of stack space is occupied by a thread.
Each TCP connection has read and write buffers at the kernel layer. The buffer size is determined by TCP kernel parameters such as tcp_rmem and tcp_wmem. You do not need to specify the buffer size. However, a large number of concurrent connections occupy a larger amount of socket cache space and result in higher memory usage of TCP.
Each request has a unique context. Multiple temporary buffers may be allocated to request packets, response packets, and ordering processes. The temporary buffers are gradually released at the end of each request. The buffers are first released to the TCMalloc cache and then gradually released to the operating system.
In most cases, memory usage is high because TCMalloc fails to promptly release the memory consumed by requests. Requests may consume up to dozens of GB of memory before the memory is released to the operating system. To query the size of memory that TCMalloc has not released to the operating system, run the db.serverStatus().tcmalloc command. The TCMalloc cache size is the sum of the values of the pageheap_free_bytes and total_free_byte parameters. Sample output:
```
{
   "generic":{
           "current_allocated_bytes":NumberLong("9641570544"),
           "heap_size":NumberLong("19458379776")
   },
   "tcmalloc":{
           "pageheap_free_bytes":NumberLong("3048677376"),
           "pageheap_unmapped_bytes":NumberLong("544994184"),
           "current_total_thread_cache_bytes":95717224,
           "total_free_byte":NumberLong(1318185960),
......
   }
}
```
Note
For more information about TCMalloc, see tcmalloc.

High memory usage of metadata

Metadata includes databases, collections, and indexes in an ApsaraDB for MongoDB instance. You must pay attention to the memory that is consumed by a large number of collections and indexes in the instance. Full logical backups may open a large number of file handles especially in ApsaraDB for MongoDB instances that run versions earlier than MongoDB 4.0. The file handles may not be promptly returned to the operating system and result in the rapid increase in the memory usage. The file handles may not be deleted after a large number of collections in an ApsaraDB for MongoDB instance are deleted and result in memory leaks.

High memory usage of index creation

In normal data writes, secondary nodes maintain a buffer that is approximately 256 MB in size for data replay. After the primary node creates indexes, secondary nodes may consume more memory for data replay. In ApsaraDB for MongoDB instances that run versions earlier than MongoDB 4.2, indexes are created in the background on the primary node. Serial replay for index creation may consume up to 500 MB of memory. In ApsaraDB for MongoDB instances that run MongoDB 4.2 or later, indexes cannot be created in the background and secondary nodes can perform parallel replay for index creation. In this case, more memory is consumed because instance out-of-memory (OOM) errors may occur when multiple indexes are created at the same time.

Note

For more information, see index-build-impact-on-database-performance and index-build-process.

High memory usage of PlanCache

If a request has a large number of execution plans, the PlanCache methods may consume a large amount of memory. In ApsaraDB for MongoDB instances that run later versions, run the mgset-xxx:PRIMARY> db.serverStatus().metrics.query.planCacheTotalSizeEstimateBytes command to view the memory usage of PlanCache. For more information, see Secondary node memory arise while balancer doing work.

Solutions

The goal of memory optimization is to seek a balance between resource consumption and performance instead of minimizing memory usage. Ideally, the memory remains sufficient and stable and system performance is not affected. You cannot change the value of the CacheSize parameter specified by ApsaraDB for MongoDB. We recommend that you use the following methods to optimize memory usage:

Control the number of concurrent connections. 100 persistent connections can be established in a database based on the results of performance tests. By default, a MongoDB driver can establish 100 connection pools with the backend. If a large number of clients exist, reduce the size of the connection pool for each client. We recommend that you establish no more than 1,000 persistent connections in a database. Otherwise, overheads in the memory and multi-thread context may increase and result in high request handling latency.
Reduce the memory overhead of a single request. For example, you can create indexes to reduce the number of collection scans and perform memory ordering.
If the number of connections is appropriate but the memory usage continues to increase, we recommend that you upgrade the memory configurations. Otherwise, system performance may sharply decline due to OOM errors and extensive cache clearing.
Accelerate the memory release of TCMalloc. If the memory usage of your instance exceeds 80%, you can adjust TCMalloc parameters to optimize the memory usage in the ApsaraDB for MongoDB console. Preferentially enable the tcmallocAggressiveMemoryDecommit parameter. The parameter has been verified by rich practices and has a significant effect on solving memory issues. If you do not archive the expected effect after the adjusting the parameter, gradually increase the value of the tcmallocReleaseRate parameter. For example, if the initial parameter value is 1, adjust the value to 3 first and then to 5.
Important
We recommend that you adjust the value of the tcmallocAggressiveMemoryDecommit and tcmallocReleaseRate parameters during off-peak hours. The adjustment may cause performance degradation. If your business is affected by the adjustment, roll the parameter values back in a timely manner.
In scenarios where memory leaks may occur, contact Alibaba Cloud technical support.ApsaraDB for MongoDB

References

Eviction parameters

Parameter	Default value	Description
eviction_target	80	When the used cache size is larger than the value of the eviction_target parameter, eviction threads evict clean pages in the background.
eviction_trigger	95	When the used cache size is larger than the value of the eviction_trigger parameter, user threads evict clean pages in the background.
eviction_dirty_target	5	When the size of dirty data in the cache is larger than the value of the eviction_dirty_target parameter, eviction threads evict dirty pages in the background.
eviction_dirty_trigger	20	When the size of dirty data in the cache is larger than the value of the eviction_dirty_trigger parameter, user threads evict dirty pages in the background.