By Yuxun
Community Redis cluster is a cluster architecture of acentric node P2P, and internally it adopts the gossip protocol to transmit and maintain the topological structure of the cluster and the cluster metadata. Visit the following link for the official Redis cluster tutorial
Failover is the fault tolerance mechanism provided by Redis cluster, and one of the most core functions of the cluster. Failover supports two modes:
Failure failover is manifested in the process of a slave assuming control of a master after a master fragment failure.
All of the fragments in a cluster are transmitted by the gossip protocol. The probe steps are:
(1) The non-traversal cluster nodes in cron send pings. They randomly select the node with the oldest pong_recv from five nodes and send the ping from pong_recv in the re-traversed nodes to the node of timeout/2.
(2) As for the time during which it has retraversed each node and times out after sending the ping packet and no pong packet has been received, the timeout will set the corresponding fragment to a pfail state, and in the course of the gossip packet with other nodes each node will carry a packet marked as pfail state.
(3) After each normal fragment has received a ping packet, the master fragment in the statistics cluster sets the failure nodes as pfail, and if more than one-half of the nodes are set as pfail the node setting will be fail state. If this fragment belongs to the slave nodes of a failure node, it will broadcast of its own accord the failure node as a fail state.
The core flow path is explained by the 3 node clusters in the figure below:
In the cron function, a slave node acquiring a master node state is fail, and it initiates of its own accord one failover operation; this operation is not executed immediately, but rather several constraints have been designed:
(1) Expired time out is not executed. How to judge whether or not it is sufficiently expired? data_age = current time point - time point of the last loss of contact with the master - time out time
If the data_age is greater than the ping interval time from master to slave + time out time * cluster_slave_validity_factor, then it will be regarded as expired. cluster_slave_validity_factor is one setting, and the smaller the cluster_slave_validity_factor is set the harder it is for it to trigger failover.
(2) The time failover_auth_time of delayed execution is calculated, and failover_auth_time = current time + 500 ms + random value of 0-500 ms + rank of current salve * 1s. The rank is calculated based on the already synchronized offset. The larger the delay of offset synchronization, the greater the rank will be, and the longer the said slave will postpone the time for triggering failover. Simultaneous failure of several slaves is thereby avoided. Failover will only be executed if the current time reaches the time of failover_auth_time.
(1) It automatically increments currentEpoch, and the reassignment gives failover_auth_epoch
(2) It initiates a failover vote to another master fragment, and awaits the voting results
(3) After the other master fragment receives a CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST request, it will judge whether or not it accords with the following circumstances:
(4) The other masters respond CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK, and the slave end does the statistics after it receives this
(5) When it is judged that statistics exceed one-half of the master responses in cron, it starts executing failover
(6) It marks its own node as master
(7) Clean up duplicate links
(8) It resets the cluster topological structure information
(9) It broadcasts to all of the nodes in the cluster
The core flow path is explained by the 3 node clusters in the figure below:
Artificial failover supports failover in three modes: default, force, takeover
(1) CLUSTERMSG_TYPE_MFSTART is sent to the master
(2) After the master receives it, it sets clients_pause_end_time = current time + 5s * 2 and clients_paused = 1, and client suspends all requests; newly established requests will be added to the block client list.
(3) The master carries the information of repl_offset in the ping packet
(4) The slave examines the repl_offset of the master, and confirms that synchronization is already finished.
(5) It is set to mf_can_start = 1, and the normal failover flow path is started in cron. It is not necessary to delay execution like the failure failover setting, but rather the operation is executed immediately, and it is not necessary to consider whether or not the master is in a fail state during the voting of other masters.
The state of master-slave synchronization is ignored, mf_can_start = 1 is set, and it is marked failover and is started.
Steps 6-9 of failure failover are executed directly, master-slave synchronization is ignored, and voting of the other masters of the cluster is ignored.
ApsaraDB for Redis is a stable, reliable and scalable database service with superb performance. It is structured on Apsara Distributed File System and full SSD high-performance storage, and supports master-slave and cluster-based high-availability architectures. It offers a full range of database solutions including disaster recovery switchover, failover, online expansion, and performance optimization. We welcome everyone to purchase and use ApsaraDB for Redis.
What Does the Future Hold for Next-Generation Cloud Database Technology in the Cloud Native Era?
Alibaba Cloud Native Community - December 26, 2022
ApsaraDB - April 17, 2019
Alibaba Clouder - January 21, 2021
Apache Flink Community China - September 27, 2020
Alibaba Clouder - December 19, 2018
Alibaba Clouder - March 19, 2020
A key value database service that offers in-memory caching and high-speed access to applications hosted on the cloud
Learn MoreApsaraDB Dedicated Cluster provided by Alibaba Cloud is a dedicated service for managing databases on the cloud.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreMore Posts by ApsaraDB