Design a disaster recovery solution based on Kubernetes container clusters - Container Service for Kubernetes

When you design a system architecture, you must consider possible threats that information systems and infrastructure may encounter, such as hardware failures, software crashes, misoperations, attacks, and natural disasters. A sound disaster recovery (DR) solution is necessary to ensure that your business recovers without hiccups in any of the preceding situations. This topic describes how to design a resilient DR architecture and solution based on Kubernetes clusters and the network, database, middleware, and observability services from Alibaba Cloud. The Kubernetes clusters can be those provided by Alibaba Cloud (Container Service for Kubernetes (ACK) clusters), another cloud service provider, or those you deploy and manage in your own data centers.

DR objectives

Recovery time objective(RTO): The maximum acceptable duration between service interruption and service recovery. RTO represents the maximum tolerable time length of service interruption.
Recovery point objective(RPO): The maximum acceptable duration from the last data recovery point. RPO represents the maximum tolerable amount of lost or reconstructed data.

A smaller RTO or RPO indicates a shorter service downtime or less lost data, but also means a higher resource cost and more complicated O&M. Therefore, you must make an appropriate RTO and RPO based on your budget.

DR strategies

Overview

The preceding figure shows three common DR strategies: Backup-Restore, Active-Standby, and Active-Active. The strategies differ in cost and benefits. You can choose a strategy based on your business importance, data loss tolerance, and budget.

Backup-Restore

As shown in the preceding figure, in Backup-Restore mode, applications and data are backed up regularly so that the backup data can be used to restore applications in another location to which business traffic is switched upon disasters.

As data is not backed up in real time, a certain amount of data is lost during data restoration, which may take a long time depending on the size of restored data.

Active-Standby

As shown in the preceding figure, in Active-Standby mode, business is handled mostly in the primary location while the secondary location runs fewer instances to save cost. Test traffic is sent periodically to verify system effectiveness.

In the event of a fault or disaster, a primary/secondary switchover is performed, in which more instances are started in the secondary location to handle the traffic diverted to it.

Active-Active

As shown in the preceding figure, in Active-Active mode, the same number of instances are running to concurrently handle the same amount of traffic in both locations.

If a fault or disaster occurs in one location, a switchover is performed for business traffic to be diverted to the other normally running location.

DR scope

Across zones (multi-AZ)

In Alibaba Cloud, a region typically contains multiple availability zones (AZs) that run on separate power and network resources. You can design a cross-AZ DR solution if you want to protect your business against small-scale disasters such as power or network outages. Inter-AZ communication features short latency. Therefore, cross-AZ DR is suitable for such applications as databases, caches, and message processors.

For more information about regions and zones, see Regions and zones.

Across regions (multi-region)

Some disasters affect an entire region, crippling multiple or even all AZs in it. In this case, you can develop a cross-region DR solution. However, cross-region communication is high in latency, and therefore the DR solution is more complicated and expensive than cross-AZ ones.

Design principle

When you design a multi-AZ or multi-region DR solution, you must confirm whether or not the stateful applications and dependent cloud services support cross-AZ or cross-region DR. Examples of stateful applications: database, cache, and message processing applications.

Solutions and examples

Backup-Restore

Cross-AZ and cross-region backup and recovery on Alibaba Cloud

The following figure shows the architecture of the solution:

The following items describe the solution:

Applications in Container Service for Kubernetes (ACK) are backed up by using the backup center of ACK One. The applications include stateless applications and stateful applications. For stateful applications, you can also back up the related storage data when you back up the application YAML data.
The backup center of ACK One integrates cloud services such as Elastic Compute Service (ECS) (snapshots), What is NAS?, Object Storage Service (OSS) (buckets and objects), and Cloud Backup. These services support the one-click backup of application YAML data, persistent volumes (PVs) of cloud disk volumes, and PVs of file systems.
Backup data of applications and storage can be restored to ACK clusters in any supported region and AZ.
Alibaba Cloud databases, such as ApsaraDB RDS for MySQL, can also be backed up and restored. For more information, see Backup and restoration and Migrate data between ApsaraDB RDS instances.

Data backup and restoration for hybrid clouds

The following figure shows the architecture of the solution:

The following items describe the solution:

Kubernetes clusters deployed in data centers or other cloud platforms can be connected to the ACK console by registering the clusters based on ACK One.
Then, the backup center of ACK One can be used to back up the applications in the registered clusters. The applications include stateless applications and stateful applications. For stateful applications, you can also back up the related storage data when you back up the application YAML data.Overview of registered clusters
Backup application data (Deployment/Statefulset) and storage data (PV/PVC) can be restored to ACK clusters in any region and AZ.

Pros and cons

A backup-restore solution is easier and lower in cost to implement than the other solutions. However, the RTO and RPO can be high, and the time to recover applications can be long depending on the amount of data and complexity of the applications. To lower the RTO and RPO, you can use a combination of full backup and incremental backups that is supported by the backup center of ACK One.

Backup-restore solutions are usually used as the last line of protection against disasters, hence their importance. During system O&M, you must ensure the regular backup of data and the usability of data backups.

Backup-restore solutions can also be used to migrate applications across clusters. The following items describe the scenarios:

Migrate applications from your data center to Alibaba Cloud ACK clusters. For more information, see Migrate applications from external Kubernetes clusters to ACK clusters.
Migrate applications to new-version clusters. This applies when the current cluster version is outdated and version updates pose risks, in which case you can create new-version clusters first and then migrate the applications. For more information, see Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.
Adjust account permissions or re-arrange organizations. For more information, see Best practice for using an ACK One Fleet instance to manage multiple clusters across platforms or across accounts and Migrate applications across clusters in different regions.

Multi-cluster Services (MCS)

During application migration, a large number of applications may need to be migrated in batches. In addition, these applications may need to communicate with each other. In this case, you can use the MCS feature of ACK One to implement cross-cluster access as long as the clusters are interconnected.

As shown in the following figure, MCS can inject the Kubernetes Service (including the endpoints) of Application2 from Cluster1 into Cluster2. This way, Application1 from Cluster2 can access Application2 that belongs to Cluster1.

You can use a leased line and the registered cluster feature of ACK One to register Kubernetes clusters from your data center or another cloud platform with ACK and then use the MCS feature of ACK One on the registered clusters.

DR solutions for multiple AZs in a single region

DNS-based traffic distribution

The following figure shows the architecture of the solution:

When the system is running normally:

When a disaster occurs:

AZ1 becomes unavailable. A primary-secondary switchover is performed. GTM diverts traffic to AZ2, in which the cluster has been automatically scaled out. At the same time, a high-availability (HA) multi-AZ switchover is performed for middleware and database services.

The following items describe the solution:

Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is implemented based on Git repositories.
DNS resolution and load distribution are implemented based on GTM. System health is monitored and DR can be automatically triggered upon disasters.
Layer-7 traffic management is implemented in each AZ by using an Ingress.
The secondary cluster runs the same application versions as the primary cluster, but has fewer nodes and application replicas to save costs.
When the primary system becomes unavailable, GTM switches traffic to the secondary system based on DNS resolution.
As traffic increases, the Horizontal Pod Autoscaling (HPA) feature scales out application replicas in the secondary cluster, which in turn triggers the autoscaler to scale out cluster nodes.
Middleware services, such as those about messages and caches, have an HA multi-AZ switchover. For more information, see Instance specifications (ApsaraMQ for RocketMQ), Instance editions (ApsaraMQ for Kafka), and Disaster recovery (Tair).
For more information about cross-AZ DR of Alibaba Cloud ApsaraDB RDS, see Build a high availability architecture.

Important

This solution is based on DNS traffic forwarding. Due to DNS caching, some traffic is still routed to the primary system when a disaster occurs, which may cause loss of some data.
This solution requires you to configure and maintain Layer-7 Ingress rules in two ACK clusters, which can be costly.

Multi-cluster gateway based on ACK One

The following figure shows the architecture of the solution:

When the system is running normally:

When a disaster occurs:

AZ1 becomes unavailable. A primary-secondary switchover is performed. The multi-cluster gateway (cloud-native MSE gateway) automatically switches traffic to ACK Cluster2 in AZ2. Application instances are automatically scaled out in ACK Cluster2.

The following items describe the solution:

Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is implemented based on Git repositories.
Standard Kubernetes Ingress rules are defined in YAML format by using a multi-cluster gateway of ACK One. This helps implement Layer-7 traffic governance and primary-secondary traffic distribution. Cross-AZ HA is implemented for the multi-cluster gateway.
The secondary cluster runs the same application versions as the primary cluster, but has fewer nodes and application replicas to save costs.
You can send test traffic with specific HTTP Headers by using the gateway to the secondary cluster to check the cluster status.
When the primary system becomes unavailable, the gateway diverts business traffic to the secondary system.
As traffic increases, the HPA feature scales out application replicas in the secondary cluster, which in turn triggers the autoscaler to scale out cluster nodes.
For more information about cross-AZ DR of Alibaba Cloud ApsaraDB RDS, see Build a high availability architecture.

Note

This solution is based on HTTP Layer-7 traffic forwarding and supports Layer-7 health checks. Compared with DNS-based traffic distribution, this solution features much less traffic loss during primary/secondary switchovers.

Traffic governance based on Ingress rules is supported on the gateway side. Compared with DNS-based traffic distribution, this solution features a simpler system architecture and lower maintenance cost based on the combination of Layer-4 load balancing (between the primary and secondary systems) and Layer-7 Ingress gateways.

Cross-AZ Active-Active

The preceding two solutions are both in Active-Standby mode and are for use in single-region multi-AZ DR. As a matter of fact, the same architecture can be used in Active-Active mode based on DNS and ACK One multi-cluster gateways. In addition, you can configure custom ratios for traffic distribution, such as 40% and 60%, and configure automatic traffic switchover in disasters.

In Active-Active mode, you must determine the number of application replicas in each cluster based on the configured ratio of traffic distribution. In addition, you must configure the auto-scaling capability in the clusters to adapt to cluster traffic changes during switchovers.

Pros and cons

Single-region multi-AZ solutions are cost-efficient. You can use multi-AZ deployment and HA for cloud services, such as cloud native gateway, container, middleware, and database services, for minimized business changes and rapid switchover.

These solutions are used to protect business against single-AZ disasters. If a region-level disaster occurs, these solutions are unable to effectively protect your business.

Cloud + IDC DR solutions for a single region

The architecture of this solution is similar to that of the single-region multi-AZ DR solution. The following items describe this solution:

A leased line connection is established between a VPC and a data center to provide management and data channels.
The cluster from the data center is connected to ACK One by using the registered cluster feature. This allows you to use the powerful observability and security capabilities of Alibaba Cloud to centrally manage your on-premises cluster and ACK cluster.
Applications are deployed in both clusters by using ACK One GitOps. Continuously consistent deployment is implemented based on Git repositories.

DNS-based traffic distribution (single-region on-premises and off-premises active-active)

The following figure shows the architecture of the solution:

Solution based on ACK One multi-cluster gateways (single-region on-premises and off-premises active-active)

The following figure shows the architecture of the solution:

Multi-region DR solutions

Single-region DR solutions are insufficient if your business is highly important, large in scale, and has a large number of users from many regions. In this case, you may require a multi-region DR solution. In such a solution, business systems are separately deployed in multiple regions to ensure that each region can independently provide services and handle faults.

The following content provides two multi-region DR solutions:

Based on a single ACK One Fleet instance
Based on multiple ACK One Fleet instances

Based on a single ACK One Fleet instance

The following figure shows the architecture of the solution:

The following items describe the solution:

Use Anti-DDoS Proxy and WAF to protect services against volumetric attacks and web application attacks, such as SQL injection attacks, cross-site scripting (XSS) attacks, and command injection attacks.
For more information, see GTM works with WAF, GA, and SLB and Protect a website service by using Anti-DDoS Pro or Anti-DDoS Premium and WAF.
User requests are routed to the nearby region by using GTM.
Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is implemented based on Git repositories.
A multi-region HA solution is adopted for caches. For more information, see Overview.
A cross-region HA solution is adopted for databases. For more information, see Multi-zone deployment architecture.
Multi-AZ DR solutions can be adopted within each region.

Based on multiple ACK One Fleet instances

The following figure shows the architecture of the solution:

The following items describe the solution:

Use Anti-DDoS Proxy and WAF to protect services against volumetric attacks and web application attacks, such as SQL injection attacks, cross-site scripting (XSS) attacks, and command injection attacks.
For more information, see GTM works with WAF, GA, and SLB and Protect a website service by using Anti-DDoS Pro or Anti-DDoS Premium and WAF.
User requests are routed to the nearby region by using GTM.
Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is conducted based on Git repositories.
A multi-region HA solution is adopted for caches. For more information, see Overview.
A cross-region HA solution is adopted for databases. For more information, see Multi-zone deployment architecture.
Multi-AZ DR solutions can be adopted within each region.

Cross-region unit-based multi-active solution

Compared with the preceding cross-region DR solution, this solution requires you to design rules to shard applications and data. This enables units to provide complete services based on data shards. This solution can securely isolate business and allows you to separately scale out business in different units to serve large user groups.

In this solution, business is divided into subunits and central units. A central unit with user data manages multiple subunits with sharded data. This solution requires the business system to support custom traffic distribution, data splitting, and unit interaction, and is highly complex to implement.

The following figure shows the architecture of the solution: