×
Community Blog Disaster Recovery Architecture and Solution Based on Kubernetes Clusters

Disaster Recovery Architecture and Solution Based on Kubernetes Clusters

This article explains how to design a disaster recovery architecture and build a resilient system using Kubernetes clusters along with Alibaba Cloud services.

By Yu Zhuang

When designing the system architecture, we must consider the potential failure of all components and infrastructure at any time, such as natural disasters, power outages, network outages, and system changes. To overcome these challenges, it is essential to design an appropriate disaster recovery architecture.

This article explains how to design a disaster recovery architecture and build a resilient system using Kubernetes clusters (including ACK clusters, clusters on third-party clouds, and on-premises IDC Kubernetes clusters) along with Alibaba Cloud services (network, database, middleware, and observability).

Disaster Recovery Target

Recovery time objective (RTO):

The maximum acceptable time delay between service interruption and service recovery. It determines the permissible duration of service downtime.

Recovery point objective (RPO):

The maximum acceptable time since the last data recovery point. It determines the acceptable data loss or reconstruction.

0

A lower value of RTO and RPO indicates less downtime and less data loss but also results in higher resource costs and higher O&M complexity. Therefore, you need to specify the appropriate RTO and RPO based on the importance of the workload.

Disaster Recovery Strategy

_1

In the preceding figure, three common disaster recovery strategies are described: backup-restore, active-standby, and active-active. Different strategies have different benefits and costs. You must analyze the importance, risks, and costs of your business to select an appropriate disaster recovery strategy.

Backup-Restore

Back up applications and data during system running time. In the event of a disaster, restore the applications and data in another location, and switch the business traffic. Data cannot be backed up in real time, so some data may be lost when restoring data. In addition, if the data volume is large, it may take a long time to restore them.

_2

Active-Standby

In the active-standby mode, the active location handles all business traffic, and the standby location can enable fewer application instances to save costs and periodically send test traffic to verify the effectiveness of the system. In the event of a disaster, perform an active-standby database switchover, scale out the number of application instances, and switch the business traffic.

_3

Active-Active

In active-active mode, two locations enable the same number of application instances and process business traffic at the same time. In the event of a disaster, perform an active-standby database switchover and switch the business traffic.

_4

Disaster Recovery Scope

Multi-AZ

An Alibaba Cloud region [11] contains multiple available zones (AZs). An AZ is a physical area that has independent power and network connections. You can use multiple AZs to design disaster recovery strategies for local interruptions such as power or network outages. Due to the short network latency between AZs, it is easier to implement disaster recovery solutions for the data part, including databases, caches, and messages.

Multi-Region

To cope with a wider range of disaster failure events that may affect multiple AZs in the same region, you can use multiple regions to design disaster recovery policies. However, the complexity and implementation cost of the disaster recovery solution is high due to the greater network latency between regions.

When you select a multi-AZ or multi-region disaster recovery solution, you must consider whether stateful applications and dependent cloud services, such as databases, caches, and messages, support multi-region or multi-AZ disaster recovery.

Examples

Backup-Restore

Backup-Restore of Cross-AZ and Cross-Region Public Clouds

  1. You can use the ACK One backup center [3] to back up applications in ACK clusters, including stateless and stateful applications. For stateful applications, you can back up related storage data while backing up the application YAML files.
  2. ACK One backup center integrates cloud product disk snapshots [12], Apsara File Storage NAS [13], Object Storage Service (OSS) [14], and Cloud Backup [15] to support one-click backup of application YAML, cloud disk PV, and file system PV respectively.
  3. After the backup, you can restore applications and storage data to an ACK cluster in any region and AZ at any time.
  4. For backup-restore of Alibaba Cloud database services, refer to the documentation of corresponding data products, such as Backup and restoration of ApsaraDB RDS for MySQL databases [16] and Migrate data between ApsaraDB RDS for MySQL instances[17].

_5

Backup-Restore of Hybrid Clouds

  1. You can use a registered cluster [4] of ACK One to connect a self-built on-premises IDC cluster or non-Alibaba Cloud Kubernetes cluster to the ACK console.
  2. After you register a cluster with ACK One, you can use the ACK One backup center to back up applications in your self-built on-premises IDC clusters or non-Alibaba Cloud Kubernetes clusters, including stateless and stateful applications. For stateful applications, you can back up related storage data while backing up application YAML files.
  3. After the backup, you can restore applications (Deployment /StatefulSet) and data (PV/PVC) to an ACK cluster in any region and AZ at any time.

_6

Summary

The implementation cost of the backup-restore solution is low, but RTO and RPO are relatively long, depending on the size of the data volume and the complexity of the application. The backup center can provide full backup and incremental backup capabilities to reduce RTO and RPO.

Backup-restore is an important disaster recovery solution. During system O&M, you must ensure the timeliness and recoverability of backups.

In addition, many users choose to use the backup-restore feature to migrate applications across clusters in the following scenarios:

  1. Migrate applications from on-premises IDC clusters to ACK clusters. For more information, see Migrate applications from external Kubernetes clusters to ACK clusters [18].
  2. If the version of the cluster is old, and the version upgrade has stability risks, you can create a new version of the cluster first, and migrate the application to the new version with backup-restore. For more information, see Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version[19].
  3. When converging cloud accounts or adjusting organizations, you need to manage multiple clusters across accounts[20] and migrate applications across clusters [21].

Multi-cluster Services

In the process of application migration, due to the large number of applications, you need to migrate in batches, and there is a call relationship between applications. In this case, you can use the ACK One Fleet multi-cluster Services [5] to implement cross-cluster access to application Kubernetes Services. As shown in the following figure, ACK One Fleet multi-cluster Services can inject the Kubernetes Services (including endpoints) of Application 2 of Cluster 1 into Cluster 2. Application 1 on Cluster 2 can access Application 2 on Cluster 1.

_7

With a leased line, you can register clusters with ACK One so that on-premises IDC clusters and non-Alibaba Cloud Kubernetes clusters can also use ACK One Fleet multi-cluster Services.

Single-region Multi-AZ Disaster Recovery

Based on DNS Traffic Distribution

  1. Use ACK One GitOps application distribution [6] to deploy applications in two ACK clusters to implement continuous consistency deployment based on Git repositories.
  2. Use the Global Traffic Manager (GTM) [22] to perform DNS resolution to distribute loads, monitor system health status, and automatically trigger switchover in the event of disasters.
  3. In each AZ, the ACK Ingress [7] implements Layer 7 traffic management.
  4. The standby and active clusters have the same application version, but the standby cluster has fewer nodes and fewer application replicas, which helps to save costs.
  5. When the active system is unavailable, GTM performs DNS resolution to resolve the service domain name to the standby system to complete the active-standby switchover.
  6. Due to increased traffic, ACK HPA [8] scales out application replicas in the standby cluster. This triggers ACK Cluster Autocaler [9] to scale out cluster nodes.
  7. For more information about the cross-AZ disaster recovery of Alibaba Cloud middleware (messages and caches), see relevant documentation, such as the Instance specifications of ApsaraMQ for RocketMQ[23], Instance specifications of ApsaraMQ for Kafka[24], and Tair disaster recovery [25].
  8. For more information about cross-AZ disaster recovery of ApsaraDB for RDS, refer to related documentation, such as Build a high availability architecture of ApsaraDB RDS for MySQL databases[26].

Notes:

  1. This solution is based on DNS traffic forwarding. Due to DNS caching, some services are still routed to the active system when a disaster occurs, causing service losses.
  2. You need to configure and maintain Layer 7 ingress rules in the two clusters separately, which is costly. When the system runs as normal:

_8

When a disaster occurs and the AZ is unavailable, the system is switched from active to standby. GTM switches traffic to AZ 2. Application instances of ACK Cluster 2 are automatically scaled out. High availability switch of middleware and database is performed across multiple availability zones.

_9

Based on ACK One Multi-cluster Gateways

  1. Use ACK One GitOps application distribution to deploy applications in two ACK clusters to implement continuous consistency deployment based on Git repositories.
  2. Use ACK One multi-cluster gateway [10] to define standard Kubernetes Ingress rules in YAML format to implement Layer 7 traffic governance and distribute traffic in active/standby mode. Multi-cluster gateways are highly available across AZs.
  3. The standby cluster and the active cluster have the same application version, but the standby cluster has fewer nodes and fewer application replicas, which helps to save costs. Test traffic with a specific HTTP header can be sent, and the multi-cluster gateway forwards it to the standby cluster to verify the work status.
  4. When the active system is unavailable, the multi-cluster gateway of ACK One automatically switches the business traffic system to the standby system.
  5. Due to the increase in traffic, ACK HPA in the standby cluster will scale out application replicas and trigger the ACK cluster autocaler to scale out cluster nodes.
  6. For more information about cross-AZ disaster recovery for Alibaba Cloud database services, refer to related documentation, such as Build a high availability architecture of ApsaraDB RDS for MySQL databases.

Notes:

  1. This solution forwards HTTP Layer 7 traffic. With Layer 7 health check, compared with the DNS solution, the business traffic loss is greatly reduced during active-standby switchover.
  2. The gateway side uniformly supports traffic governance based on Ingress rules. Compared with the DNS solution, Layer 4 Server Load Balancer (SLB) and Layer 7 Ingress gateways are combined to reduce system complexity and maintenance costs.

When the system runs as normal:

_10

When a disaster occurs and the AZ becomes unavailable, the system switches between the active and standby nodes. The multi-cluster gateway (MSE cloud-native gateway) automatically switches traffic to ACK Cluster 2 of AZ 2. Application instances are automatically scaled out.

_11

Cross-AZ Active-active

The preceding two scenarios use the active/standby mode as an example to describe the system architecture. In the same architecture, DNS traffic distribution and ACK One multi-cluster gateways also support active-active scenarios. You can configure the traffic distribution ratio (for example, 50%: 50%) and automatic failover is supported. In active-active scenarios, the number of application replicas in each cluster must be determined based on the traffic distribution ratio. Auto-scaling must be configured in the cluster to support traffic growth during traffic switching.

Summary

The single-region multi-AZ solution is cost-effective. This solution uses cloud products (including gateways, containers, middleware, and databases) to implement multi-AZ deployment and multi-AZ high availability, realizing disaster recovery quickly without much business transformation. However, this solution can only deal with disasters and failures in a single zone, not in a region.

Disaster Recovery Solution: Single-region Cloud + IDC

The architecture of this solution is similar to that of the single-region multi-zone disaster recovery solution. The key points are as follows:

  1. Establish a leased line connection between the VPC and the IDC to open up control and data tunnel.
  2. Register a cluster with ACK One to connect an on-premises IDC cluster and use Alibaba Cloud's powerful observability and security capabilities to manage the on-premises cluster and the ACK cluster in a unified manner.
  3. Use ACK One GitOps application distribution to deploy applications in the two clusters to achieve continuous consistency deployment based on Git repositories.

Based on DNS Traffic Distribution (Active-active: Active on the cloud and active in the data center in a single region)

_12

Based on ACK One Multi-cluster Gateways (Active-active: Active on the cloud and active in the data center in a single region)

_13

Cross-region Disaster Recovery

If the business scale is large and serves a great number of users in a wide range, disaster recovery solutions in a single region cannot meet the high availability requirements of the business. In this case, a multi-region disaster recovery solution is required. Deploying business systems independently in multiple regions can ensure that the business systems in each region have a separate closed loop to provide complete service capabilities.

  1. Global Traffic Manager (GTM) allows users to access the nearest region.
  2. Use ACK One GitOps application distribution to deploy applications in two ACK clusters to implement continuous consistency deployment based on Git repositories.
  3. For high-availability solutions that cache multiple domains, refer to Alibaba Cloud product documentation, such as Global Distributed Cache for Tair[27].
  4. For more information about the database cross-region high availability solution, see Alibaba Cloud database service documentation, such as Global database networks (GDN) of PolarDB for MySQL[28].
  5. Within a region, a single-region multi-AZ disaster recovery solution can be used.

_14

Unit-based Multi-active Deployment

Different from the previous solution, multi-region unit-based deployment requires the design of sharding rules to shard applications and data, so that the units provide complete service capabilities for partial data shards, thus implementing business security fault isolation and horizontal expansion, serving a large user group. Generally, units are divided into a central unit with all user data and multiple sub-units with detailed data after sharding. This method requires the support of the business system, custom distribution rules, data splitting, and inter-unit coordination, so it is highly complex.

_15

Summary

Various disaster events affect the availability of your business. However, you can mitigate or eliminate these impacts by leveraging the disaster recovery capabilities offered by Alibaba Cloud products. Firstly, it's essential to understand your business's availability requirements before selecting an appropriate disaster recovery strategy. Then, you can use Alibaba Cloud products, including containers (Container Service for Kubernetes, or ACK [1] and Distributed Cloud Container Platform for Kubernetes, or ACK One [2] ), messages, caches, and databases, to design a disaster recovery architecture to quickly achieve the RTO and RPO of your business availability requirement.

References

[1] Container Service for Kubernetes (ACK)
https://www.alibabacloud.com/product/kubernetes
[2] Distributed Cloud Container Platform for Kubernetes (ACK One)
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/product-overview/ack-one-overview
[3] ACK One backup center
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/backup-center-overview
[4] ACK One registered clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/overview-9
[5] ACK One Fleet multi-cluster Services
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/mcs-overview
[6] ACK One GitOps application distribution
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/gitops-overview
[7] ACK Ingress
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/ingress-overview
[8] ACK HPA
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/horizontal-pod-autoscaling
[9] ACK Cluster Autocaler
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/auto-scaling-of-nodes
[10] ACK One multi-cluster gateways https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/multi-cluster-gateway-overview
[11] Region
https://www.alibabacloud.com/help/en/cloud-migration-guide-for-beginners/latest/regions-and-zones#concept-z04-bg5-j8w
[12] Disk snapshots
https://www.alibabacloud.com/help/en/ecs/user-guide/copy-a-snapshot
[13] Apsara File Storage NAS https://www.alibabacloud.com/help/en/nas/product-overview/what-is-nas
[14] Object Storage Service (OSS)
https://www.alibabacloud.com/help/en/oss/product-overview/what-is-oss
[15] Cloud Backup
https://www.alibabacloud.com/help/en/cloud-backup/product-overview/what-is-hbr
[16] Backup and restoration of ApsaraDB RDS for MySQL databases
https://www.alibabacloud.com/help/en/flink/developer-reference/log-service-connector
[17] Migrate data between ApsaraDB RDS for MySQL instances https://www.alibabacloud.com/help/en/rds/apsaradb-rds-for-mysql/migrate-data-between-apsaradb-rds-for-mysql-instances
[18] Migrate applications from external Kubernetes clusters to ACK clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/migrate-applications-from-self-managed-kubernetes-clusters-to-ack-clusters
[19] Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/use-backup-center-to-migrate-applications-from-clusters-running-lower-kubernetes-versions
[20] Manage multiple clusters across accounts https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/use-cases/use-ack-one-to-manage-clusters-across-cloud-platforms-and-alibaba-cloud-accounts
[21] Migrate applications across clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/migrate-applications-across-clusters-in-different-regions
[22] Global Traffic Manager (GTM) https://www.alibabacloud.com/help/en/global-traffic-manager/latest/what-is-gtm
[23] Instance specifications of ApsaraMQ for RocketMQ https://www.alibabacloud.com/help/en/apsaramq-for-rocketmq/cloud-message-queue-rocketmq-5-x-series/product-overview/instance-specifications
[24] Instance specifications of ApsaraMQ for Kafka
https://www.alibabacloud.com/help/en/apsaramq-for-kafka/cloud-message-queue-for-kafka/product-overview/instance-editions
[25] Tair disaster recovery
https://www.alibabacloud.com/help/en/tair/product-overview/disaster-recovery
[26] Build a high availability architecture of ApsaraDB RDS for MySQL databases
https://www.alibabacloud.com/help/en/rds/apsaradb-rds-for-mysql/build-a-high-availability-architecture
[27] Global Distributed Cache for Tair
https://www.alibabacloud.com/help/en/tair/user-guide/overview-of-global-distributed-cache-for-tair
[28] Global database networks (GDN) of PolarDB for MySQL
https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-49

0 1 0
Share on

Alibaba Container Service

166 posts | 30 followers

You may also like

Comments

Alibaba Container Service

166 posts | 30 followers

Related Products