Challenges and Considerations about Alibaba Cloud Application Scaling

This article describes the problems and solutions Alibaba Cloud encountered while designing and implementing intelligent application scaling strategies in a production environment.

By Yanxun, Core Development Engineer of Alibaba Cloud EDAS and Andy Shi, Technical Evangelist of Alibaba Cloud

Cloud-native technology stacks are increasingly widespread, but how can we implement Kubernetes in a more efficient and accessible manner to show the real value of cloud-native? This is a new challenge and a hot topic. The focus on cloud-native technology gradually has shifted from "usage" to "better usage." Thus, CNCF SIG App Delivery teamed up with Alibaba Cloud's Cloud-Native Application Platform Team. They launched a series of articles called From Zero to One: Building a Cloud-Native Application Management Platform. These articles aim to help readers better implement and practice core cloud-native technologies and build their own application-centered Kubernetes platform.

Background

Alibaba Cloud Enterprise Distributed Application Service (EDAS) is an all-in-one PaaS platform for application lifecycle management and monitoring. Besides, it is the first Internet-level commercial platform to implement the Open Application Model (OAM) on the public cloud. Today, the kernel of the EDAS application management layer is built on the native Kubernetes cluster based on the KubeVela open-source project. The platform has served thousands of cloud application developers in an efficient, stable, intelligent, and scalable manner. In this article, we use the underlying technology of EDAS as a specific example. We describe the problems and solutions Alibaba Cloud encountered while designing and implementing intelligent application scaling strategies in a production environment. We also include the best practices for building a cloud-native application platform.

Challenges and Considerations about Application Scaling

As a core product for application management and delivery, EDAS has already completed the overall architecture migration from the exclusive virtual machine to a Kubernetes container cluster in the early days. Like most Kubernetes-based PaaS platforms, EDAS implements application automatic scaling based on CPU and memory, which are provided by the native Horizontal Pod Autoscaler (HPA) of Kubernetes at this stage. However, with the increase of users and diversified demands, the native HPA-based application scaling policy gradually exposed many shortcomings.

First, automatic scaling is not supported for fine-grained load metrics at an application level, such as RT and QPS.

As a "Platform for Platform" project, the built-in capabilities of Kubernetes are mainly used for container-level management and orchestration. However, for products that focus on applications and users, scaling metrics, such as CPU and memory, are too coarse-grained. Although HPA provides a degree of customizing metrics, its overall scalability is not flexible enough. In addition, the pluggability of customized metrics is poor. When we tried to refine metrics to applications or source code, the HPA code, which is a part of Kubernetes code, needed modification. Therefore, we must think about how to implement fine-grained application scaling policies through an external framework with powerful scalability.

Second, the application's demand for scale-to-zero is not supported.

We know scale-to-zero is a typical automatic scaling scenario in Serverless and FaaS scenarios. It can effectively help users save idle resources and reduce platform usage costs. In modern microservices applications, many microservices hosted on the cloud by users also have some characteristics of Serverless applications, such as being stateless and traffic-based responding. Thus, scale-to-zero is also an important requirement for them. However, the built-in HPA in Kubernetes is not suitable for this scenario and does not provide this capability. EDAS is a full-featured PaaS product and seeks atomicity that is independent and free from platform binding. These demands make it impossible to solve the problems in all user scenarios by introducing Serverless solutions, such as OpenFaas or Knative.

Third, scheduled scaling is not supported.

Except for scale-to-zero, scheduled scaling is an indispensable feature required by EDAS users. Similarly, this application O&M capability must be the independent atomicity capability. We cannot just introduce a complete set of solutions from another platform for one requirement.

Alibaba Cloud planned a new version of EDAS with an automatic scaling capability to solve the preceding problems. At the same time, the underlying architecture of EDAS has been undergoing a series of evolution and upgrades based on the Open Application Model (OAM) since the beginning of 2020. By doing so, the team aims to introduce a standardized and pluggable application definition model to replace the original Application CRD of EDAS. Then, the team can provide an application-centered upper abstract to users rather than forcing users to learn the underlying concepts in Kubernetes. The team can also use the scalability of the model to ensure that EDAS can insert various capabilities from the cloud-native ecosystem into products with one click. Therefore, the design and implementation of this new automatic and elastic scaling component are integrated with the OAM-based architecture of EDAS.

In this new architecture, the automatic elastic scaling policy of an application is the Trait of this application. The concept of "application" here is a Kubernetes-based upper abstraction exposed for users by EDAS through OAM. It is described with primitive words on the user side. Then, there is a question, "How can the user-defined and application-oriented elastic scaling policy be implemented or selected in the specific implementation layer of Kubernetes?"

Combining the three specific challenges mentioned earlier and the OAM-based Kubernetes-native design of the new EDAS, the team decided to introduce a horizontal scaling component from the open-source community to solve the preceding problems. The team summarizes three main selection requirements for EDAS scenarios:

This horizontal scaling component should provide simple and stable atomization capabilities that are not bound to a specific scenario solution, such as Serverless. This is also the basic specification and requirement of the OAM model for "application traits."
The scaling metrics of this horizontal scaling component should be plug-ins, so the team can easily expand the application-centered elastic policy. The policy is based on timing, the number of messages in a message queue, application monitoring metrics, and AI predictions.
It natively supports scale-to-zero and complies with the first requirement.

After evaluation and selection in the community, the team finally chose the open-source KEDA project of Microsoft, which is hosted by CNCF. The KEDA project natively supports scale-to-zero. More importantly, it decouples the scaled object from scaling metrics for application-level horizontal scaling and proposes corresponding abstract interfaces respectively through the Scaler + Metrics Adapter mechanism. This provides a powerful plug-in mechanism and a unified definition method for all scaling policies. In addition, the design and architecture of KEDA are relatively simple, without complex black technologies. Many built-in scalers can be used directly, meeting the overall demands of EDAS.

EDAS Cloud-Native PaaS Architecture based on OAM and KEDA

In terms of technical architecture, the kernel of Alibaba Cloud EDAS is built based on the KubeVela open-source project from the OAM community. With the native extension mechanism of Kubernetes provided by OAM, the EDAS R&D team does not need to be the same as the traditional PaaS team. The team doesn't have to perform massive secondary development or modify the user-side API when launching features from the cloud-native open-source community, such as KEDA. The team only needs to register the CRD of KEDA as an Autoscale Trait of EDAS according to the OAM specification. Then, users can use the newly added horizontal scaling capability after completing the monitoring data connection. The overall architecture is shown on the chart below:

In its implementation, EDAS drives KEDA for rapid horizontal scaling of the workload, based on the fine-grained application-level monitoring data provided by Alibaba Cloud ARMS. ARMS Scaler was added in KEDA. EDAS also fixed many problems and enhanced some aspects of KEDA v1, including:

The false capacity values are caused by the metric value addition of multiple triggers of the same type instead of independent calculation.
An over-length name will be trimmed to 63 characters during the creation of HPA in KFDA. Some errors may occur because the name may not be in compliance with the DNS specification.
The triggers cannot be forbidden, which may pose a stability risk in the production environment.

These problems have been submitted (or are being submitted) by the EDAS Team to the KEDA upstream, and some of them have been fixed in the KEDA v2.

Kubernetes has a long-standing problem where automatic scaling and gray release often conflict. To address this problem, EDAS uses the semantics of the OAM model layer to carry out the mutual exclusion of these two capabilities.

Current Work and Future Plans

EDAS is currently working with open-source communities to add many new capabilities to the KEDA-based Autoscaler Trait, including:

Triggers can be forbidden.
Decider abstraction is provided, so more decision-making logic can be added during scaling in a scalable manner.
Dry Run function is supported
Grey release, rollback, and observation of capacity changes are supported.
Webhook notifications are supported.

In the future, the EDAS Team will mainly focus on integrating the current architecture with the AIOps capabilities of EDAS. Thus, a more intelligent and elastic experience for the entire platform can be achieved, including:

1. More Intelligent Decision-Making Mechanisms

Comprehensive decision-making based on the upstream and downstream application state
Comprehensive decision-making based on adaptive traffic limiting
Comprehensive decision-making based on expert system, the network closure, and promotion rules
Comprehensive decision-making based on historical data analysis
Capacity diagnosis and automatic recommendation of scaling policies

2. More Controllable Scaling Process

Webhook notification during scaling changes
Interactive scaling change operation after manual confirmation
Grey release, rollback, and observation of scaling changes
Dry Run Function

3. Richer Trigger System

QoS Trigger
Database Metric Trigger
Trigger by message queue metric

In the next version, these KEDA-based innovations and enhancements will bring more powerful, intelligent, and stable application auto-scaling capabilities and a friendlier user experience.

Summary

This article introduces the challenges and solutions of the Alibaba Cloud Enterprise Application Platform during the support of the horizontal scaling component of KEDA by using the automatic elastic scaling of EDAS as an example. This procedure is based on the OAM and KubeVela projects in the classic PaaS scenario. In the future, this KEDA-based platform will integrate with a wider range of scaling metrics and more intelligent decision-making mechanisms.

As the cloud-native ecosystem evolves, Alibaba Cloud EDAS is practiced on a large scale in the cloud-native application management field. EDAS brings application versioning, dependency management, O&M feature interaction, batch delivery, and other enhancements. Moreover, it provides a wide range of best practices and experiences. Alibaba Cloud EDAS can integrate with the "new forces" of cloud-native communities, such as KEDA, with the support from standardized and open product architecture. It launches powerful application management capabilities from open-source communities for users in a standardized and scalable manner. It achieves user-centered technological innovation and evolution and moves towards the next era of PaaS cloud-native application.

Community

Challenges and Considerations about Alibaba Cloud Application Scaling

Background

Challenges and Considerations about Application Scaling

EDAS Cloud-Native PaaS Architecture based on OAM and KEDA

Current Work and Future Plans

1. More Intelligent Decision-Making Mechanisms

2. More Controllable Scaling Process

3. Richer Trigger System

Summary

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Microservices Engine (MSE)

ACK One

Container Registry

Container Service for Kubernetes