By Yu Zhuang
When designing the system architecture, we must consider the potential failure of all components and infrastructure at any time, such as natural disasters, power outages, network outages, and system changes. To overcome these challenges, it is essential to design an appropriate disaster recovery architecture.
This article explains how to design a disaster recovery architecture and build a resilient system using Kubernetes clusters (including ACK clusters, clusters on third-party clouds, and on-premises IDC Kubernetes clusters) along with Alibaba Cloud services (network, database, middleware, and observability).
Recovery time objective (RTO):
The maximum acceptable time delay between service interruption and service recovery. It determines the permissible duration of service downtime.
Recovery point objective (RPO):
The maximum acceptable time since the last data recovery point. It determines the acceptable data loss or reconstruction.
A lower value of RTO and RPO indicates less downtime and less data loss but also results in higher resource costs and higher O&M complexity. Therefore, you need to specify the appropriate RTO and RPO based on the importance of the workload.
In the preceding figure, three common disaster recovery strategies are described: backup-restore, active-standby, and active-active. Different strategies have different benefits and costs. You must analyze the importance, risks, and costs of your business to select an appropriate disaster recovery strategy.
Back up applications and data during system running time. In the event of a disaster, restore the applications and data in another location, and switch the business traffic. Data cannot be backed up in real time, so some data may be lost when restoring data. In addition, if the data volume is large, it may take a long time to restore them.
In the active-standby mode, the active location handles all business traffic, and the standby location can enable fewer application instances to save costs and periodically send test traffic to verify the effectiveness of the system. In the event of a disaster, perform an active-standby database switchover, scale out the number of application instances, and switch the business traffic.
In active-active mode, two locations enable the same number of application instances and process business traffic at the same time. In the event of a disaster, perform an active-standby database switchover and switch the business traffic.
An Alibaba Cloud region [11] contains multiple available zones (AZs). An AZ is a physical area that has independent power and network connections. You can use multiple AZs to design disaster recovery strategies for local interruptions such as power or network outages. Due to the short network latency between AZs, it is easier to implement disaster recovery solutions for the data part, including databases, caches, and messages.
To cope with a wider range of disaster failure events that may affect multiple AZs in the same region, you can use multiple regions to design disaster recovery policies. However, the complexity and implementation cost of the disaster recovery solution is high due to the greater network latency between regions.
When you select a multi-AZ or multi-region disaster recovery solution, you must consider whether stateful applications and dependent cloud services, such as databases, caches, and messages, support multi-region or multi-AZ disaster recovery.
The implementation cost of the backup-restore solution is low, but RTO and RPO are relatively long, depending on the size of the data volume and the complexity of the application. The backup center can provide full backup and incremental backup capabilities to reduce RTO and RPO.
Backup-restore is an important disaster recovery solution. During system O&M, you must ensure the timeliness and recoverability of backups.
In addition, many users choose to use the backup-restore feature to migrate applications across clusters in the following scenarios:
In the process of application migration, due to the large number of applications, you need to migrate in batches, and there is a call relationship between applications. In this case, you can use the ACK One Fleet multi-cluster Services [5] to implement cross-cluster access to application Kubernetes Services. As shown in the following figure, ACK One Fleet multi-cluster Services can inject the Kubernetes Services (including endpoints) of Application 2 of Cluster 1 into Cluster 2. Application 1 on Cluster 2 can access Application 2 on Cluster 1.
With a leased line, you can register clusters with ACK One so that on-premises IDC clusters and non-Alibaba Cloud Kubernetes clusters can also use ACK One Fleet multi-cluster Services.
Notes:
When a disaster occurs and the AZ is unavailable, the system is switched from active to standby. GTM switches traffic to AZ 2. Application instances of ACK Cluster 2 are automatically scaled out. High availability switch of middleware and database is performed across multiple availability zones.
Notes:
When the system runs as normal:
When a disaster occurs and the AZ becomes unavailable, the system switches between the active and standby nodes. The multi-cluster gateway (MSE cloud-native gateway) automatically switches traffic to ACK Cluster 2 of AZ 2. Application instances are automatically scaled out.
The preceding two scenarios use the active/standby mode as an example to describe the system architecture. In the same architecture, DNS traffic distribution and ACK One multi-cluster gateways also support active-active scenarios. You can configure the traffic distribution ratio (for example, 50%: 50%) and automatic failover is supported. In active-active scenarios, the number of application replicas in each cluster must be determined based on the traffic distribution ratio. Auto-scaling must be configured in the cluster to support traffic growth during traffic switching.
The single-region multi-AZ solution is cost-effective. This solution uses cloud products (including gateways, containers, middleware, and databases) to implement multi-AZ deployment and multi-AZ high availability, realizing disaster recovery quickly without much business transformation. However, this solution can only deal with disasters and failures in a single zone, not in a region.
The architecture of this solution is similar to that of the single-region multi-zone disaster recovery solution. The key points are as follows:
If the business scale is large and serves a great number of users in a wide range, disaster recovery solutions in a single region cannot meet the high availability requirements of the business. In this case, a multi-region disaster recovery solution is required. Deploying business systems independently in multiple regions can ensure that the business systems in each region have a separate closed loop to provide complete service capabilities.
Different from the previous solution, multi-region unit-based deployment requires the design of sharding rules to shard applications and data, so that the units provide complete service capabilities for partial data shards, thus implementing business security fault isolation and horizontal expansion, serving a large user group. Generally, units are divided into a central unit with all user data and multiple sub-units with detailed data after sharding. This method requires the support of the business system, custom distribution rules, data splitting, and inter-unit coordination, so it is highly complex.
Various disaster events affect the availability of your business. However, you can mitigate or eliminate these impacts by leveraging the disaster recovery capabilities offered by Alibaba Cloud products. Firstly, it's essential to understand your business's availability requirements before selecting an appropriate disaster recovery strategy. Then, you can use Alibaba Cloud products, including containers (Container Service for Kubernetes, or ACK [1] and Distributed Cloud Container Platform for Kubernetes, or ACK One [2] ), messages, caches, and databases, to design a disaster recovery architecture to quickly achieve the RTO and RPO of your business availability requirement.
[1] Container Service for Kubernetes (ACK)
https://www.alibabacloud.com/product/kubernetes
[2] Distributed Cloud Container Platform for Kubernetes (ACK One)
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/product-overview/ack-one-overview
[3] ACK One backup center
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/backup-center-overview
[4] ACK One registered clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/overview-9
[5] ACK One Fleet multi-cluster Services
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/mcs-overview
[6] ACK One GitOps application distribution
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/gitops-overview
[7] ACK Ingress
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/ingress-overview
[8] ACK HPA
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/horizontal-pod-autoscaling
[9] ACK Cluster Autocaler
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/auto-scaling-of-nodes
[10] ACK One multi-cluster gateways https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/multi-cluster-gateway-overview
[11] Region
https://www.alibabacloud.com/help/en/cloud-migration-guide-for-beginners/latest/regions-and-zones#concept-z04-bg5-j8w
[12] Disk snapshots
https://www.alibabacloud.com/help/en/ecs/user-guide/copy-a-snapshot
[13] Apsara File Storage NAS https://www.alibabacloud.com/help/en/nas/product-overview/what-is-nas
[14] Object Storage Service (OSS)
https://www.alibabacloud.com/help/en/oss/product-overview/what-is-oss
[15] Cloud Backup
https://www.alibabacloud.com/help/en/cloud-backup/product-overview/what-is-hbr
[16] Backup and restoration of ApsaraDB RDS for MySQL databases
https://www.alibabacloud.com/help/en/flink/developer-reference/log-service-connector
[17] Migrate data between ApsaraDB RDS for MySQL instances https://www.alibabacloud.com/help/en/rds/apsaradb-rds-for-mysql/migrate-data-between-apsaradb-rds-for-mysql-instances
[18] Migrate applications from external Kubernetes clusters to ACK clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/migrate-applications-from-self-managed-kubernetes-clusters-to-ack-clusters
[19] Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/use-backup-center-to-migrate-applications-from-clusters-running-lower-kubernetes-versions
[20] Manage multiple clusters across accounts https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/use-cases/use-ack-one-to-manage-clusters-across-cloud-platforms-and-alibaba-cloud-accounts
[21] Migrate applications across clusters
https://www.alibabacloud.com/help/en/ack/distributed-cloud-container-platform-for-kubernetes/user-guide/migrate-applications-across-clusters-in-different-regions
[22] Global Traffic Manager (GTM) https://www.alibabacloud.com/help/en/global-traffic-manager/latest/what-is-gtm
[23] Instance specifications of ApsaraMQ for RocketMQ https://www.alibabacloud.com/help/en/apsaramq-for-rocketmq/cloud-message-queue-rocketmq-5-x-series/product-overview/instance-specifications
[24] Instance specifications of ApsaraMQ for Kafka
https://www.alibabacloud.com/help/en/apsaramq-for-kafka/cloud-message-queue-for-kafka/product-overview/instance-editions
[25] Tair disaster recovery
https://www.alibabacloud.com/help/en/tair/product-overview/disaster-recovery
[26] Build a high availability architecture of ApsaraDB RDS for MySQL databases
https://www.alibabacloud.com/help/en/rds/apsaradb-rds-for-mysql/build-a-high-availability-architecture
[27] Global Distributed Cache for Tair
https://www.alibabacloud.com/help/en/tair/user-guide/overview-of-global-distributed-cache-for-tair
[28] Global database networks (GDN) of PolarDB for MySQL
https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-49
ACK One: Building a Hybrid Cloud Zone-Disaster Recovery System
Deploying Serverless Applications with ACK One and Knative for On-Premises Data Centers
175 posts | 31 followers
FollowAlibaba Container Service - June 13, 2024
Alibaba Container Service - April 12, 2024
Alibaba Container Service - December 5, 2024
Alibaba Container Service - November 7, 2024
Alibaba Container Service - November 21, 2024
Alibaba Clouder - February 22, 2021
175 posts | 31 followers
FollowProtect, backup, and restore your data assets on the cloud with Alibaba Cloud database services.
Learn MoreAlibaba Cloud provides products and services to help you properly plan and execute data backup, massive data archiving, and storage-level disaster recovery.
Learn MoreCloud Backup is an easy-to-use and cost-effective online data management service.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreMore Posts by Alibaba Container Service