Watch the replay of the Apsara Conference 2024 at this link!
High availability configurations for clusters and applications are the foundation of cluster stability, ensuring that applications continue to operate reliably even when there are unexpected failures in infrastructure.
However, during the rapid iteration of applications and daily operation and maintenance of the cluster, human errors such as accidental deletion of cluster resources can occur. For critical business applications, it is recommended to perform periodic disaster recovery and a single backup before business iterations, rollbacks, and high-risk operations within the cluster to reduce the RTO and RPO when related failures happen.
Therefore, effective disaster recovery measures serve as the last line of defense for business stability.
For businesses running in Kubernetes clusters, the objects and goals of disaster recovery have changed:
Since the business has been orchestrated by the Kubernetes, the objects of disaster recovery should be the cluster resources with workloads (that is, container groups) as the core and the associated cloud resource information. For stateful applications, additional considerations must be given to the disaster recovery for data within storage volumes.
The recovery after containerization focuses more on the continuity of the business. The objective is to relaunch the workload, keep the relevant configuration unchanged, and restore the external service. The recovery can be either in-place (the backup cluster and recovery cluster are the same) or cross-cluster (they are not the same cluster).
ACK provides a one-stop disaster recovery solution for backup centers to meet new disaster recovery features and requirements.
Backup center overview:
Cluster O&M personnel can use the console to create periodic backup plans or one-time application backups with a single click. Compared with ETCD backup, the backup center supports selecting applications to back up based on dimensions such as namespaces, labels, and resource types. For stateful applications, it supports simultaneously backing up storage volume data mounted by the business.
For enterprises with established GitOps processes, the data protection feature of the backup center can be used to perform disaster recovery exclusively for storage volume data.
Before recovery, simple adjustments to resources such as namespaces and image registry mappings can be made through configuration.
For more complex and advanced adjustment needs, flexible configurations such as traffic redirection, replica count adjustments, and configuration file modifications can be achieved through ConfigMap settings.
When recovery is required, the target backup supports restoring either entire or partial applications and storage volumes. In addition to pre-configured adjustment strategies, the backup center will also automatically adjust resource recovery orders and certain configurations, such as switching storage drivers in cross-cloud scenarios, during the recovery process, ensuring compatibility with ACK system components and Alibaba Cloud ecosystem.
The application backup console of the cluster provides the relevant usage process guidelines.
After creating an OSS bucket for storing backups and associating it with a backup vault, you can create a backup plan or initiate an immediate backup in the console.
The status and details of backups are displayed in the backup record list.
To restore a backup, simply click "Immediate Restore" for the target backup. If there are no advanced configuration requirements, the backup and restoration of application data can be accomplished within a single console.
Next, using a MySQL stateful application as an example, we will discuss the challenges faced in achieving the goal of coherent business recovery.
As Kubernetes continues to evolve, the number of official resource types it provides is increasing, along with various versions of these resources and user-defined resource types. This means the number of resources involved in backup and recovery operations is also growing.
Taking MySQL as an example, during deployment, you might use Secret to store accounts and passwords, ConfigMap to store startup configurations, PVC and PV to record underlying storage information, and Service and Ingress to store ports and network resource information.
With so many resource types, the first challenge is how to ensure the completeness of backups. Any missing resource can lead to the failure of the application during recovery.
-The backup center, based on the definitions of Kubernetes resources, automatically appends dependent resources during backup according to their dependencies. For example, if MySQL uses a custom resource, the relevant CRD will be automatically appended during backup, ensuring successful deployment of the custom resource in a new cluster.
Actually, improper recovery orders can also result in the loss of business operational states. For instance, if MySQL’s StatefulSet (STS) is recovered first, the Kubernetes STS Controller will automatically create replicas (Pods) based on the STS’s replica count and template. As a result, the configuration information patched onto the backed-up Pod resources would be overwritten and lost.
-By default, the backup center maintains the recovery order of official Kubernetes resources, prioritizing the deployment of resources dynamically created by controllers.
Finally, if you attempt to use kubectl get -oyaml to obtain runtime resource configurations, you will find that before applying this configuration file, you need to manually clean up or adjust a large number of fields. For example, if the nodeName field appended by the controller after scheduling is retained during recovery, the scheduling phase will be skipped, leading to the inability to launch the application in a new cluster.
-Similarly, the backup center is compatible with the Kubernetes ecosystem and performs default adjustments to resources. If users have specific adjustment needs, they can customize correction strategies via configuration.
All automatic adjustments for backup and recovery are strongly dependent on the version of Kubernetes. The backup center, while remaining compatible with older Kubernetes cluster versions, will continue to iterate alongside the community.
When deploying MySQL, you might need cloud disks to store actual DB data and NAS and OSS to store runtime information of logs.
Different storage systems have varying native disaster recovery methods, such as snapshots for cloud disks and recycle bins for NAS.
Even if storage sources are backed up and recovered, applications cannot directly use the recovered storage sources like cloud disks. This is because applications require PVC and PV resources as a bridge to connect with the underlying storage layer. In Kubernetes clusters, applications specify mounting information using PVC, which has a one-to-one binding relationship with PV, and PV records the actual storage source information, such as the cloud disk instance ID.
For ACK users who previously used the deprecated FlexVolume storage driver, or for those migrating from IDCs or third-party clouds to ACK, there is a challenge of inconsistent storage drivers within Kubernetes clusters.
The above sections introduce the characteristics, challenges, and solutions provided by the backup center for containerized business disaster recovery. However, you might still have some questions:
Where is my backup data stored? Which cloud product ensures the SLA for backup data?
If I use scheduled backups, such as once daily at midnight, won’t the data volume and overhead be significant?
These are concerns that users often have.
In fact, each backup (and recovery) task can be decomposed into up to three sub-tasks: backup sub-tasks for cluster resources, block storage data, and file system storage data. The backup strategies for these three sub-tasks, such as TTL, frequency, and target applications, are consistent.
Cluster resource backup: It is developed based on the open-source Velero community and integrated with Alibaba Cloud ecosystem and ACK system components via internal plugins. When users perform a backup, they only need to focus on their business. All cluster resources (of multiple API versions) will be backed up and stored in the backup vault. The backup vault is actually linked to the OSS bucket provided by the user.
Block storage data protection: Based on the cloud disk snapshot feature of Alibaba Cloud, it offers quick backup and guarantees data consistency for individual disks (or multiple disks for snapshot-consistent groups, pending). For widely used ESSD cloud disks, IA snapshots are enabled by default, allowing backup and recovery available within seconds. Additionally, snapshots are incremental.
File system data protection: Based on Alibaba Cloud backup service, it provides capabilities such as increment, compression, deduplication, and encryption to speed up the backup and recovery process. Through the storage class conversion feature of the backup center, the data backed up by the file system can be converted into other file system types, such as an ext4-mounted cloud disk or NAS.
The disaster recovery capabilities on the cloud can also assist on-premises Kubernetes clusters.
By connecting on-premises IDC clusters or Kubernetes clusters from other cloud vendors to ACK One’s registered clusters and deploying the backup center components, cloud-based disaster recovery can be easily achieved. Similar to ACK clusters, cluster resources will be backed up to Alibaba Cloud OSS, and storage volume data will be backed up to Alibaba Cloud backup service.
By backing up in the registered cluster and restoring in the ACK cluster, easy migration to the cloud can be achieved.
For storage volumes of stateful applications, data migration to the cloud can also be realized through the storage class conversion feature.
Overview of registered clusters in ACK One: https://www.alibabacloud.com/help/en/ack/overview-9
Since its release, an increasing number of users have been using the backup center to achieve seamless migration across major Kubernetes versions, migration across VPCs, and migration of IDCs to the cloud. Here is a summary of some capabilities that users are particularly interested in:
After backing up file system data, the storage media can be changed through storage class conversion. This has already been mentioned, so it won't be elaborated further here.
The Kubernetes itself has the converting function possessed by the API version. Based on the function, Velero supports, for example, backing up a Deployment in a 1.16 cluster and recovering it in a 1.30 cluster:
During backup, all API versions of the resources are backed up, such as Deployment being backed up with extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps/v1.
During recovery, the recommended version of the recovery cluster is preferentially switched, that is, apps v1.
If the recommended version does not exist in the backup, then a compatible version supported by the recovery cluster is restored.
In version 1.16, most important groups like core, apps, network.k8s.io, and storage.k8s.io have resources at v1, but some resources like Ingress cannot be restored in clusters of version 1.22+. On top of the community efforts, the backup center further enhances compatibility for such resources, enabling automatic switching (or manual creation by the customer).
Different recovery scenarios may have varying requirements for Service restoration, such as whether to reuse ports, load balancer instances, or enable forced listening. These requirements can be addressed through configurable adjustment strategies. For cluster business migration scenarios, this feature ensures that the application is launched normally and traffic can be manually switched after business checks, without affecting external services.
By default, the safer logic of skipping resources with the same name is applied. For scenarios requiring upgrades or changes, an attempt can be made to update existing resources using the Kubernetes JsonMergedPatch method.
The video demo showcases the functions mentioned above. The configurations and business requirements for the backup and recovery clusters are as follows:
In the demo:
First, the installation status of components in the 1.16 cluster is shown, along with the deployed MySQL application. Focus is placed on checking the load balancer ID of the LoadBalancer type Service, the volume type, and simulated DB data. (The process of excluding OSS storage volume data backup by labeling the MySQL application is omitted.)
The backup is performed in the backup center, verifying the status of the backed-up storage volumes and cluster resources.
Then, the installation status of components in the 1.30 cluster is shown. In the backup center, it is confirmed that the backup has been synchronized to the current cluster and restored (in some scenarios, related ConfigMaps need to be pre-created for service traffic switchover).
The storage class conversion of the volume is checked, along with the restored application, emphasizing that the load balancer ID of the LoadBalancer type Service remains unchanged and the simulated DB data is recovered.
Practice on the Construction of a Production-level Observability System of Alibaba Cloud ACK
166 posts | 30 followers
FollowAlibaba Container Service - April 17, 2024
Alibaba Container Service - October 30, 2024
Alibaba Container Service - April 12, 2024
Alibaba Container Service - November 21, 2024
Alibaba Cloud Native - October 16, 2023
Aliware - November 4, 2021
166 posts | 30 followers
FollowAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreMore Posts by Alibaba Container Service