This topic lists the disadvantages of the scaling solutions available for open source Redis clusters and ApsaraDB for Redis cluster instances and describes the imperceptible scaling solution available for Tair cluster instances.
Requirements for elastic scaling of data nodes and migration of data between shards are often involved with open source Redis clusters. However, these common scaling solutions have issues such as slow key-based migration, lack of support for commands that involve multiple keys, inability to migrate Lua scripts, potential latency or even high availability switchover triggered by large key migration, and complexity in handling migration failures and rollbacks.
In response to these issues, the Tair team developed a new architecture based on slot replication to ensure imperceptible data migration. This architecture optimizes the thread scheduling algorithm within instances to allow instances to be managed in an efficient and accurate manner. This is the principle of imperceptible scaling. The following section describes the disadvantages of the scaling solutions available for open source Redis clusters and ApsaraDB for Redis cluster instances, and introduces the imperceptible scaling solution available for Tair cluster instances.
Common scaling solutions for open source Redis clusters and their disadvantages
Elastic scaling of open source Redis clusters
Open source Redis clusters use gossip protocols to transfer data. During data transfer, a slot is migrated as the smallest dataset by traversing and migrating keys in the slot. This scaling solution has the following issues:
Low stability
Commands that involve multiple keys in the same slot may fail to run because data is being migrated by key.
Lua scripts cannot be replicated at the same time when data is migrated. As a result, Lua scripts may be lost after the data migration is complete.
When data is replicated, migration of large keys may cause latency or even errors that may trigger high availability switchover.
High O&M difficulty
If an error occurs during data migration, you must manually restore the database data. This process is difficult, time-consuming, and error-prone.
It takes a long time to scale clusters because migrating data by key is time-consuming. This often forces businesses to perform scaling activities during off-peak hours, which can impact normal business operations.
Scaling based on data synchronization and migration components
This solution relies on middleware components rather than open source Redis clusters to migrate data. For example, to perform scaling operations, you can create a cluster, use a middleware component to migrate data to the cluster, and then use a load balancer component to switch access paths. This scaling solution has the following issues:
It takes a long time to synchronize full data.
Costs are high because you must create two sets of resources to perform scaling operations.
Clients are disconnected when a load balancer component performs a switchover. It takes a long time for the switchover to take effect, and the service may be unavailable for up to 10 seconds.
Imperceptible scaling solution for ApsaraDB for Redis cluster instances and its disadvantages
The imperceptible scaling solution available for ApsaraDB for Redis cluster instances can address the preceding issues of open source Redis clusters. However, this solution also has the following issues:
Data migration performed during scaling operations affects instances, and instances may become read-only for a few seconds.
NoteData migration needs may arise when you perform scaling operations. During data migration, when the proportion of synchronized incremental data to the total amount of incremental data to be synchronized is less than a threshold, the instance becomes read-only. After the remaining incremental data is synchronized, clients can re-connect to the instance by using a new endpoint. If an update request is sent to the instance while the instance is in the read-only state, the request is rejected and a read-only error is returned. Read-only errors cannot be smoothly handled, which may affect your business.
Data migration performed during scaling operations competes for resources with regular operations. This often forces businesses to perform scaling activities during off-peak hours, which can impact the flexibility of scaling operations.
Imperceptible scaling solution for Tair cluster instances
Tair cluster instances provide an imperceptible scaling solution built on a new architecture that efficiently handles operations on clusters with centralized control components.
This solution is available for Tair cloud disk-based instances, including DRAM-based instances and persistent memory-optimized instances. ApsaraDB for Redis Community Edition cloud disk-based instances also use this solution by default.
This solution has the following benefits:
Imperceptible scaling
While a Tair cluster instance is being scaled, your clients are not affected, your business is not interrupted, and the instance does not remain in the read-only state. You can scale a Tair cluster instance at any time.
The key to imperceptible scaling is reducing the read-only time period on instances during data migration. Tair cluster instances dynamically estimate the amount of time required to migrate remaining incremental data and keep the read-only time period within milliseconds. This theoretically prevents instances from entering the read-only state because this read-only time period is far less than the TCP retransmission time that is measured in hundreds of milliseconds. When an instance remains in the read-only state, the write requests made for the keys to be migrated are cached to the instance instead of being written. After the data migration is complete, clients receive redirection messages. At the same time, the management system and the database engine work together to update instance information as soon as the data migration is complete. This process ensures that scaling operations are imperceptible to clients.
Smooth scaling
Tair optimizes the thread scheduling algorithm within cluster instances to implement fine-grained management of data migration tasks. This improves thread execution efficiency from 10% to a maximum of 80%. You can specify a custom efficiency value within this range. This way, the data migration speed is maximized without impacting your business. In addition, Tair cluster instances support fine-grained scaling without increasing the reaction time (RT) to prevent high availability switchover caused by network jitter. This ensures high data reliability.
Efficient and easy O&M
Tair cluster instances can address the scaling issues of open source Redis clusters by using the following methods:
Pre-backup in the background: Pre-backup in the background can be implemented for Tair instances. This method ensures that the source instance remains fully functional and holds the complete dataset until the migration is complete. This prevents latency caused by large key migration.
Rollback with a few clicks: You can roll back instances with a few clicks if exceptions occur during scaling.
Data migration by slot: Data can be migrated by slot. This ensures that commands that involve multiple keys in the same slot can run as expected.
Lua script replication: During data migration, Lua scripts can be replicated to prevent Lua script loss.
Horizontal scaling: Up to 256 shards can be added to or removed from a single instance.
Cost-effectiveness
Compared with solutions that require a middleware component, this solution reduces costs because you do not need to create two sets of resources.