All Products
Search
Document Center

Well-Architected Framework:Design Principles

Last Updated:Sep 25, 2023

In a distributed system, the reliability issues to be considered are relatively complex, covering the design state, development state, operation and maintenance state of software systems, covering from IaaS, PaaS to upper-layer SaaS systems. In order to ensure that the system can operate stably, it is recommended to follow the following design principles.

Failure-Oriented Architecture Design Principle

It is well known that abnormal events in the system are inevitable, such as network delays, hardware failures, software errors, peak traffic surges, etc. It is recommended to start from the "failure" caused by these abnormal events in the system design stage, and provide redundancy, isolation, degradation, and elastic capabilities, aiming to ensure the high availability and reliability of the system to cope with inevitable failures and accidents.

Fine-Grained Operability and Control Principle

Due to the expansion of business and the further decomposition of system services, the complexity of distributed systems has increased dramatically. In addition, Cloud service iterations have accelerated with multiple versions. Meanwhile, some businesses have high real-time requirements. The uncertainty and complexity of operation and maintenance increase significantly. It is recommended to improve operability, certainty, and stability through fine-grained management and observability methods, such as version control, gray release, monitoring and alarm, automatic inspection, etc.

Risk-Oriented Emergency Rapid Recovery Principle

In some scenarios, it is impossible to maintain zero system failure, even with various technical means to improve redundancy and high availability. Therefore, it is necessary to establish an efficient emergency management process and stable technical platform to achieve real-time discovery of failure risks, effective coordination of emergency teams, accurate recording of handling processes, rapid loss prevention and recovery, as well as follow-up failure retrospection, aiming to improve the efficiency of emergent response, reduce the impact, avoid recurrence of similar failures, and enhance the overall high availability of the system.