By Zhongxi (github @zhongxig), Head of AppActive, from the cloud-native high availability architecture team of Alibaba Cloud, is engaged in the research and development of disaster recovery architecture and quick recovery of failures and open-source work.
Following the open-source release of Sentinel and Chaosblade by the High-Availability Architecture team, we are pleased to announce our latest addition to the open-source family: AppActive. These three tools help enterprises build stable and reliable enterprise-level production systems and improve their steady-state system construction capabilities in the face of disaster tolerance, fault tolerance, capacity management, and other common issues.
On January 11, 2022, at the Cloud-native Practical Summit in Shanghai, Ding Yu, a researcher of Alibaba Cloud Intelligence, released the "White Paper on Application Multi-active Technology". At the same time, to promote the development of disaster recovery in the industry and establish disaster recovery standards for cloud-native businesses, Alibaba Cloud has made the AppActive application multi-active middleware open source. In this article, we will share the highlights of Ding Yu's sharing at the summit on AppActive and its benefits.
In 2019, many of Alibaba's core systems had become fully cloud-based, and the architecture of active geo-redundancy followed the pace to incubate a new Alibaba Cloud product known as AHAS-MSHA to service groups and cloud customers.
On January 11, 2022, the AHAS-MSHA code was officially open-source and re-named AppActive.
AppActive is an open-source middleware that builds cloud-native architecture with highly available multi-active disaster recovery for business applications. Its main value include:
• Minute-level RTO. The recovery time is fast. The average recovery time of Alibaba's internal production level is within 30s, and the average recovery time of an external customer production system is one minute.
• Make full use of resources. There are no idle resources. Multiple data centers and resources are fully utilized to avoid resource waste.
• The high success rate of switching. Relying on the mature multi-active technology architecture and visual operation and maintenance platform, compared with the existing disaster recovery architecture, the switching success rate is high, and the success rate of thousands of times per year in Alibaba is as high as more than 99.9%.
• Precise traffic control. The application of multi-active support traffic from the top to the bottom closed, relying on the precise drainage ability to get the specific business traffic into the corresponding data room. Enterprises can use this advantage to incubate the canary release, key traffic guarantee, and other features.
Through nearly 9 years of practical experience in serving Alibaba Group and more than 2 years of commercial iteration accumulation in serving customers on the cloud, the AHAS-MSHA has helped us deal with a variety of disaster recovery scenarios. The popularity continues to grow, and the stability and functional characteristics of the code have also been fully tested.
In 2021, many well-known companies and cloud platforms at home and abroad experienced serious service interruptions and downtime. This sounded the alarm for enterprises, and more and more them have invested more on disaster recovery capabilities. While solving the issue of disaster recovery, to maintain cost control and support the future multi-cloud architecture evolution and the certainty of disaster recovery, many enterprises choose to try it out in a multi-active disaster recovery mode.
However, there is no unified understanding of multi-activity in the industry. Different enterprises have different definitions of "multi-active". Many enterprises often think that they have already realized it. However, when the failure comes, they find that the fault escape ability of the current system is weak, and the business recovery and fault location fail to be decoupled, thus dragging down the production of enterprises and causing problems such as external public opinion and capital loss.
In addition, after understanding "multi-active", some enterprises subconsciously want to invest resources in technical rehearsals. However, due to lack of experience, it often causes repeated waste of human and material resources. With the development of cloud-native technology, more and more customers use cloud-native technology to build systems. How to build a stable and highly available system on the cloud-native architecture is a core challenge. The cognitive bias of the concept will aggravate the investment of enterprises in infrastructure costs, application transformation costs, and operation and maintenance costs. There are problems of inefficiency, misuse, or even uselessness, thus preventing them from enjoying the stability dividend brought by "multi-active". Therefore, "multi-active" requires a relatively unified standard and cognition to deepen users' understanding and use of it, thereby improving the stability of the business system.
Under the current situation of cloud-native development and market cognition, the project leader of AppActive, Zhongxi, said that the open-source and interpretation of application multi-activity can initially define the standard and implementation and help developers form a unified cognition. When an enterprise builds this architecture, it shares existing mature experience to avoid redundant resource waste. At the same time, different enterprises have different business scenarios and advantages, in reverse to promote the application multi-activity to mature forms and capabilities. I hope to rely on the power of the community to make it a de facto inclusive technology, rather than an unavailable technology, helping more enterprises and individuals build production-level highly available architectures.
The standard definitions of application high availability include LRA (Local Region Active), UDA (Ultra Distance Active), HCA (Hybrid Cloud Active), and BFA (Business Flow Active). In the AppActive v0.1 release, we prioritized the basic capabilities of BFA and UDA. While improving BFA and UDA in subsequent releases, we added LRA and HCA capabilities. This article focuses on BFA and UDA.
BFA refers to the final presentation of multi-active applications as a business. Multi-active disaster recovery systems are equipped with the refined deployment of production traffic based on business characteristics.
In the BFA metric, AppActive supports automatic traffic correction and strong routes to the automatic closed-loop of the specified data center, which belongs to the refined provisioning of traffic.
When illegal traffic flows into the data center, plug-ins at all layers of the data center rely on unified scheduling rules for processing:
• The access layer identifies the wrong traffic and automatically corrects the error to the correct data center.
• The service layer identifies the wrong traffic and automatically corrects errors to the correct data center.
• The data layer identifies the wrong traffic. To ensure data quality, an exception is thrown and the write fails.
UDA means that the business system still has good access performance when the distance between the data centers exceeds 300 kilometers. When the systems enter the disaster recovery state, the RTO and RPO are at the minute level.
AppActive supports good access performance in UDA metrics.
Traffic parsing is supported at the access layer. Request traffic is parsed and the traffic is sent to the application machines in the data center. Based on the capabilities of the application-side Servlet plug-in, Dubbo plug-in, and MySQL plug-in, business traffic requests are self-closed in a single data center and finally read and write to the database in this data center.
In ultra-long-distance scenarios, the business system still has good access performance because traffic is enclosed inside the data center.
The RPO that enters the disaster recovery state is guaranteed by the open-source data synchronization components or the commercial synchronization tools. The RTO provides only the primary traffic switching capability in the AppActive 0.1 version. Later versions will evolve to the production-level RTO assurance tool.
AppActive belongs to a definition and implementation of application multi-activity. It has an overall implementation of the data plane and the control plane. The data plane is divided into four parts, all of which support adding capabilities in the form of plug-ins without changing the technical components used by the original enterprise:
• Access the gateway. As the first hop of the business traffic into the data center, the access gateway is responsible for the identification and distribution of application multi-active ingress traffic. It has two core capabilities: data center routing and application routing.
• Service layer. The synchronous call mode business traffic within and across data centers generally includes the roles of Consumer, Provider, and Registry. It has three core capabilities: traffic routing, traffic protection, and fault isolation. It avoids dirty data writing caused by call errors and accelerates service recovery during traffic cuts.
• Message layer. The asynchronous call mode business traffic within and across data centers is based on message peak cutting and valley filling. Generally, it has the roles of Producer, Consumer, and Broker. It has the three core capabilities of traffic routing, traffic protection, and fault isolation to avoid dirty data caused by message miscast and protect messages from being lost during flow cutting.
• Data layer. It covers data reading and writing, data storage, and data synchronization for business applications. It has three core capabilities: traffic routing, data consistency protection, and data synchronization.
The control plane core covers the daily operation of multi-active disaster recovery rules and traffic switching for disaster scenarios.
The current AppActive is in v0.1 and open source:
• The above-mentioned data plane defines the basic implementation of all layers.
• The Nginx plug-in implementation of the access layer gateway.
• The Dubbo2.x plug-in implementation in the service layer.
• The MySQL plug-in implementation in the data layer.
• The basic capability of traffic switching in the control plane.
Based on the capabilities of v0.1, developers can run and verify the basic functions of multi-active applications.
"Ultra distance active" and "unitization" originated from Alibaba and have also been recognized by the industry. Alibaba has always hoped that the application of multi-active product ecology can be standard and open, and contribute to the industry.
Based on the application-active standard technology, business applications can be interconnected between different cloud vendors, different infrastructures, and different chips. While making full use of resources, business applications reach the RTO index of minutes or even seconds, which is true without fear of failure.
The first version of open-source AppActive is just a starting point for the application multi-active field. Welcome to participate and build an application multi-active ecosystem together.
Decrypting the Three Major Components of Dubbo for High-Availability Deployment
KubeDL HostNetwork: Accelerating Communication Efficiency for Distributed Training
508 posts | 48 followers
FollowAlibaba Cloud Storage - April 3, 2019
ApsaraDB - September 3, 2018
Alibaba Cloud Community - October 9, 2022
Alibaba Cloud Native Community - July 12, 2022
Alibaba Developer - December 16, 2021
Alibaba Clouder - August 28, 2019
508 posts | 48 followers
FollowApplication High Available Service is a SaaS-based service that helps you improve the availability of your applications.
Learn MoreCustomized infrastructure to ensure high availability, scalability and high-performance
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by Alibaba Cloud Native Community