Part 1 of this 2-part series >>
There are two ways to avoid this problem:
The first solution is to add a random time when setting the expiration time of the key. The pseudo code can be written this way:
# Randomly expire within 5 minutes after the expiration time.
redis.expireat(key, expire_time + random(300))
The second solution is to enable the lazy-free mechanism for Redis 4.0 or later.
# Release the memory of expired keys and put it into a background thread for execution.
lazyfree-lazy-expire yes
At the O&M level, you need to monitor the running status data of Redis. You can run the INFO command on Redis to obtain all the running status data of this instance.
Here, we need to focus on expired_keys, which represents the cumulative number of expired keys deleted in the entire instance so far.
You need to monitor this metric. When this metric has a sudden increase in a short period, you need to report it in time, and compare and analyze it with the time point when the business application reports slow down to check whether the time is consistent. If they are consistent, you can confirm that the latency is indeed caused by the centralized expired key.
When you create a synchronization node, Redis preferentially checks whether you can attempt to synchronize only some data. The replication link is temporarily disconnected due to a fault in this case. When you re-establish synchronization after the fault is recovered, Redis preferentially attempts to synchronize some data to avoid resource consumption for full synchronization. If the synchronization condition is not met, Redis triggers full synchronization. This judgment is based on the size of the replication buffer maintained on the master. If this buffer is configured too small, the data in the replication buffer will likely be overwritten due to the writes generated by the master during the period when the master-slave replication is disconnected. The offset position that the slave needs to synchronize when re-establishing synchronization cannot be found in the master buffer, then full synchronization will be triggered at this time. How to avoid this situation? The solution is to increase the size of the copy buffer repl-backlog-size. The default size of this buffer is 1MB. If the instance writes a large amount of data, you can increase this configuration.
If you want to bind the CPU, the optimized solution is not to bind the Redis process to only one CPU logical core but to multiple logical cores. Moreover, the bound multiple logical cores should preferably be the same physical core, so they can share L1/L2 Cache.
Even if we bind Redis to multiple logical cores, it can only alleviate the competition for CPU resources among the main thread, sub-process, and background threads to a certain extent.
Since these sub-processes and sub-threads switch on these multiple logical cores, there is a performance loss.
Perhaps, you have thought about whether we can make the main thread, sub-process, and background thread bind to fixed CPU cores and prevent them from switching back and forth. Then, the CPU resources they use do not affect each other.
Redis thought of this plan.
Redis introduced this function in version 6.0. We can bind fixed CPU logic cores to the main thread, background thread, background RDB process, and AOF rewrite process through the following configuration.
Bind CPU Cores before Redis6.0
taskset -c 0 ./redis-server
Bind CPU Cores after Redis6.0
# Redis Server and I/O threads are bound to CPU cores 0,2,4,6.
server_cpulist 0-7:2
# Bind the background child thread to CPU cores 1,3.
bio_cpulist 1,3
# Bind the background AOF rewrite process to CPU cores 8,9,10, and 11.
aof_rewrite_cpulist 8-11
# Bind the background RDB process to CPU cores 1,10,11.
# bgsave_cpulist 1,10-1
If you are using Redis version 6.0, you can use the configuration above to improve Redis performance.
Reminder: Generally, Redis performance is good enough. Unless you have more stringent requirements on Redis performance, we do not recommend binding the CPU.
$ redis-cli info | grep process_id
process_id: 5332
Then, go to the process directory in the /proc directory of the machine where Redis is located.
$ cd /proc/5332
Finally, run the following command to view the usage of the Redis process. Here, I only intercepted part of the results:
$cat smaps | egrep '^(Swap|Size)'
Size: 584 kB
Swap: 0 kB
Size: 4 kB
Swap: 4 kB
Size: 4 kB
Swap: 0 kB
Size: 462044 kB
Swap: 462008 kB
Size: 21392 kB
Swap: 0 kB
Once a memory Swap occurs, the most direct solution is to increase the machine’s memory. If the instance is in a Redis slicing cluster, you can increase the number of instances in the Redis cluster to allocate the data volume of each instance and reduce the amount of memory required by each instance.
If large memory pages are used, Redis needs to copy 2MB of large pages even if the client requests only 100B of data to be modified. In contrast, if it is a conventional memory page mechanism, only 4KB is copied. Compared with the two, you can see that when the client requests to modify or write a large amount of new data, the memory large page mechanism will lead to a large number of copies, which will affect the normal memory access operation of Redis and eventually lead to slower performance.
First of all, we need to check the memory large page. Run the following command on the machine where the ApsaraDB for the Redis instance runs:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
If the execution result is always, the memory large page mechanism is enabled. If the execution result is never, the memory large page mechanism is disabled.
I do not suggest using the memory large page mechanism in the actual production environment. The operation is simple; just execute the following command.
echo never /sys/kernel/mm/transparent_hugepage/enabled
The memory large page mechanism provided by the operating system has the advantage that the number of times that an application program applies for memory can be reduced on a certain program.
However, Redis is sensitive to performance and latency, so we hope Redis will take as little time as possible to apply for memory each time. Therefore, I do not recommend enabling this mechanism on Redis machines.
Supported Versions: Redis 4.0 +
127.0.0.1:7000> LLEN mylist
(integer) 2000000
127.0.0.1:7000> UNLINK mylist
(integer) 1
127.0.0.1:7000> SLOWLOG get
1) 1) (integer) 1
2) (integer) 1505465188
3) (integer) 30
4) 1) "UNLINK"
2) "mylist"
5) "127.0.0.1:17015"
6) ""
Note: DEL commands or concurrent blocking delete operations
127.0.0.1:7000> DBSIZE
(integer) 1812295
127.0.0.1:7000> flushall // Synchronously cleans instance data. The 1.8 million key takes 1020 milliseconds.
OK
(1.02s)
127.0.0.1:7000> DBSIZE
(integer) 1812637
127.0.0.1:7000> flushall async // Asynchronously cleans up instance data. The 1.8 million key takes about 9 milliseconds.
OK
127.0.0.1:7000> SLOWLOG get
1) 1) (integer) 2996109
2) (integer) 1505465989
3) (integer) 9274 // The instruction takes 9.2 milliseconds to run.
4) 1) "flushall"
2) "async"
5) "127.0.0.1:20110"
6) ""
Lazy-free is applied to passive deletion. Currently, there are four scenarios, and each scenario corresponds to a configuration parameter. It is disabled by default.
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
slave-lazy-flush no
lazyfree-lazy-eviction means whether to use lazy free mechanism when Redis memory usage reaches maxmemory and an elimination policy is set. If lazy free is enabled in this scenario, the memory release using the elimination key may not be timely, resulting in Redis memory over-usage and exceeding the limit of maxmemory. When operating undering this scenario, test it with your business. We don’t recommend setting yes in the production environment.
lazyfree-lazy-expire means whether to use the lazy-free mechanism when a key with TTL expires. It is recommend to enable it in this scenario because TTL is the speed of adaptive adjustment.
For some instructions, when processing existing keys, there will be an implicit DEL key operation (such as the rename command). When the target key already exists, Redis will delete the target key first. If these target keys are a bigkey, it will cause performance problems that block deletion. This parameter setting is to solve this type of problem and is recommended to be enabled.
For full data synchronization for the slave, the slave will run flushall to clean up its data scenarios before loading the RDB file of the master. The parameter settings determine whether to use the exceptional flush mechanism. If the memory changes are small, we recommend enabling it. This reduces the time required for full synchronization. This reduces the memory usage growth of the primary database due to output buffer popping.
The data metric that lazy free can monitor only has one value: lazyfree_pending_objects, which indicates the number of keys that Redis performs lazy free operations and is waiting to be recycled. It does not reflect the number of elements of a single large key or the size of memory waiting to be reclaimed by lazy free. Therefore, this value has a certain reference value, which can be used to monitor the efficiency of the Redis lazy free or the number of stacked keys. For example, there will be a small number of stacked keys in the flushall async scenario.
# info memory
# Memory
lazyfree_pending_objects:0
Note: The unlinkCommand() and del functions of the unlink command call the same function delGenericCommand() to delete a key. Lazy indicates whether the key is lazyfree. If lazyfree, the dbAsyncDelete() function is called.
However, lazy-free is not necessarily enabled for every unlink command. Redis will judge the cost of releasing the key (cost) and perform lazy-free only when the cost is greater than LAZYFREE_THRESHOLD(64).
Release key cost calculation function lazyfreeGetFreeEffort(), set type key, and meet the corresponding encoding. Cost is the number of elements of the set key; otherwise, the cost is 1.
Sample request:
Redis provides a configuration item that allows the background child thread not to flush the disk (without triggering the fsync system call) when the child process is in AOF rewrite.
This is equivalent to temporarily setting appendfsync to none during the AOF rewrite. The configuration is listed below:
# During the AOF rewrite, the AOF background sub-thread does not flush the disk.
# This is equivalent to temporarily setting appendfsync to none during this period.
no-appendfsync-on-rewrite yes
If you turn on this configuration item, if the instance goes down during the AOF rewrite, more data will be lost at this time. You need to weigh performance and data security.
If the disk resources are occupied by other applications, it is relatively simple. You need to locate which application is writing a large number of disks and then migrate this application to other machines for execution to avoid affecting Redis.
If you have high requirements for Redis performance and data security, we recommend optimizing the hardware level, replacing it with an SSD disk to improve the I/O capability of the disk, and ensuring that sufficient disk resources can be used during AOF. At the same time, make Redis run on a separate machine as much as possible.
In most cases, you must restart a Redis instance when you release a Swap. To avoid the impact of the restart on your business, you must perform a master-replica switchover first, release the Swap on the original master node, restart the original master node, and perform the master-replica switchover after data synchronization from the database is complete.
The preventive method is that you need to monitor the memory and Swap usage of the Redis machine and alert when the memory is insufficient, or Swap is used. Handle it in time.
1,044 posts | 257 followers
FollowAlibaba Cloud Community - July 13, 2023
Alibaba Cloud_Academy - June 26, 2023
ApsaraDB - July 10, 2019
Alibaba Clouder - January 22, 2018
Alibaba Clouder - March 19, 2020
Alibaba Clouder - May 2, 2018
1,044 posts | 257 followers
FollowA key value database service that offers in-memory caching and high-speed access to applications hosted on the cloud
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreA HPCaaS cloud platform providing an all-in-one high-performance public computing service
Learn MoreMore Posts by Alibaba Cloud Community
Dikky Ryan Pratama July 14, 2023 at 2:39 am
Awesome!