By Moyuan
In 2020, the COVID-19 pandemic swept the world, resulting in a large number of business/factory shutdowns and supply chain disruptions, which had a huge impact on the global economy. 65% of enterprises are considering improving their IT informatization capabilities through cloud access to deal with other potential systemic risks in the future. As the most advanced way of cloud migration, cloud-native has become the best choice for most enterprises to carry out IT transformation.
Capgemini, a well-known consulting firm, conducted a survey in 2020 called Cloud-Native Comes of Age. They spoke with roughly 900 executives about how cloud-native applications are enabling business agility and innovation. The survey shows that only 15% of companies have built new applications in a cloud-native environment, but the proportion will rise to 52% over the next three years. The report defines enterprises that deploy more than 20% of applications in cloud-native environments as leaders. How do they view cloud-native technologies?
However, according to the FinOps Kubernetes Report released by CNCF in 2021, after migrating to the Kubernetes platform, 68% of the respondents said the cost of computing resources in their enterprises has increased, and 36% of the respondents said the cost has soared by more than 20%. Even with the cost reduction and efficiency increase feature that is the consensus of most leading companies, many enterprises still encounter many obstacles in the process of cloud-native transformation, paying more costs. Why does the adoption of cloud-native technology fail to realize the ideal scenario?
Raymond is the Head of the IT platform of an Internet e-commerce company. Over the past two years, he has led the team to carry out the cloud-native transformation of all the company's businesses. Raymond's original intention of choosing cloud-native technology as the platform architecture method is very simple. He chose it because cloud-native technology represented by microservices, containers, and DevOps can perform unified delivery and operation and maintenance of different types of applications and reduce management costs. It can realize automated construction and delivery through assembly lines and improve the speed of research and development. It can realize resource sharing and elasticity between applications through container technology and reduce resource waste. Users can further squeeze the utilization of cluster resources through mixing and preemption between different types of applications.
Raymond's team is responsible for the stable operation of the company's five platforms. According to the business features, O&M convenience, security level, and cost considerations, Raymond divides the business into three clusters:
The service stability requirements of the primary site are high. The planning of the entire cluster is mainly based on static node pools, and the ability to cooperate with scheduled scaling is expanded in advance before the service peak arrives. When the capacity is low during the day, the space of the cluster is reused by the hybrid transcoding service time-sharing, thereby improving resource efficiency.
The reason why the livestreaming service and the big data service are placed in one cluster is that whether it is the ad hoc query of data lake, livestreaming service, or ETL job of big data, the consumption of computing resources per unit time is very large. However, the capacity of the service is relatively random, and the highly flexible scenario is more suitable for both services.
The micro-business is placed independently in a cluster, mainly for security and isolating tenant data and business data. Independent clusters also allow better cost accounting.
As a very senior expert in the field of cloud-native, Raymond's technology selection, cluster splitting, and optimization strategies are impeccable. In the first month of cloud-native, it is stable and efficient, and everything seems to be moving towards the expected results.
“The cost increased 70% last month?” Raymond muttered to himself after getting the latest bill. What is the problem?
Previously, Raymond's team used a more traditional, mature static enterprise IT cost management model. The cycle of this model is usually monthly or quarterly. The procurement of IT assets achieves the goal of enterprise IT cost management through the implementation of resource planning, cost estimation, cost budgeting, and cost control.
The advantage of this model is that the cost budget derived from each IT cost management is fixed and is very friendly from an IT asset management perspective. However, the disadvantages are also clear. When the business has frequent changes in capacity, it may cause large deviations in the cost estimation stage, resulting in a lot of waste.
Cloud-native is often used to reduce costs and increase efficiencies, such as intelligent scheduling, auto scaling, mixing, and time-sharing preemption. In essence, the exclusive sharing of resources becomes sharing, and the static supply of resources becomes dynamic. The adoption of any new technology will inevitably transform and optimize the architecture of the existing system. However, the dynamic transformation of the introduction of cloud-native technology architecture will often break the traditional IT cost management system in the enterprise, which causes IT cost management to go out of control. When IT cost management is out of control, various optimization strategies become groundless.
When Raymond tried to find clues about problems through bills, he got a hundred pages of monthly bill details. It was almost impossible to trace back the applications and departments that caused abnormal costs from the bill details. The problem Raymond encountered is a problem that almost every person in charge of cloud-native architecture must overcome.
What makes cost management difficult for cloud-native IT enterprises?
There is a certain matching relationship between business units and billing units in the traditional IT cost management model. For example, a portal website contains two ECS, an access gateway SLB, and RDS. Its business unit and billing unit are one-to-one, and the bill is the cost.
However, in cloud-native scenarios, when an application is deployed in a container cluster such as Kubernetes, all resources are pooled. The minimum metering unit of the service is a pod, and the lifecycle of the pod does not match the lifecycle of the node that generates bills. In most scenarios, when applications are redeployed, the pods of the service are rescheduled to other nodes. As a result, the business unit and billing unit may not be able to achieve a one-to-one matching relationship in the three dimensions of logic, space, and time.
This makes it difficult to arrive at concrete results when the business units of an enterprise want to measure, plan, and estimate the budget of a business.
In the traditional enterprise IT management model, the relationship between planning/budgeting and resource delivery is static. Business departments can submit budgets on a monthly, quarterly, and annual basis. Then, the IT department conducts unified procurement and distribution. Containers adopt technologies and solutions (such as auto scaling) to solve the problem of resource waste in the static capacity planning model. Dynamic resource delivery is used to control capacity costs.
However, the dynamic resource delivery model may introduce other cost traps in actual production. Typically, most of the traditional static planning models will use the annual and monthly billing method, but the dynamic resource delivery model will mix the annual and monthly and pay-as-you-go models. Some scenarios introduce special payment policies, such as Saving Plan, Reserved Instances, and Preemptible Instances. In contrast, the billing unit price of a monthly or annual subscription is about 30-50% of a pay-as-you-go model. When the proportion of resources delivered dynamically is unreasonable, it may cause a lot of waste of IT costs.
In addition, the budget and procurement of the traditional static capacity planning model are implemented in one phase, so IT cost management does not need to pay attention to cost trends. However, when a large number of dynamic resource delivery models are implemented, the IT administrator of an enterprise needs to pay attention to the total cost changes and the cost trend and even predict the cost in some scenarios to ensure the cost of the cluster will not exceed the budget on an unexpected scale.
The traditional IT cost management model focuses more on efficiency enhancement in terms of cost control. Costs are reduced in the next capacity planning phase by improving machine utilization. Efficiency and cost reduction are carried out at the same time in cloud-native IT cost management scenarios. Enterprises can adjust resource quotas through monitoring and Artificial Intelligence Recommendation to achieve resource utilization. Resource costs are reduced through auto scaling and dynamic resource delivery. The way of reducing costs and increasing efficiency at the same time will significantly shorten the cycle of the enterprise IT cost management model and put forward more requirements for budget management, quota management, cost trend prediction, and cost trend alarm.
The optimization method of the traditional IT cost management model is relatively single, usually through the guidance of resource utilization and other indicators to achieve the purpose of cost reduction and efficiency. Various optimization methods emerge one after another in cloud-native scenarios. However, any optimization scheme will bring challenges to the stability of the existing architecture.
In essence, the optimization of cloud-native scenarios mainly focuses on the dynamics of scheduling/resources. Users can improve resource utilization and the overall cluster water level or total core time cost reduction through the means of moving, time-sharing, preemption, and scaling. Most optimizations are for domain scenarios. Before implementing cloud-native IT cost optimization solutions, enterprises need to measure and evaluate the risks brought about by architectural changes and the expected benefits of the optimization solutions.
The four problems above are obstacles that every enterprise can't bypass when managing IT costs during cloud-native transformation. They restrict the pace of cloud-native transformation and perplex a large number of cloud-native technology leaders, including Raymond. Cloud-native IT cost management was created to solve the problems above.
Alibaba Cloud has one of the most complete offering of container products among cloud providers globally. As early as 2006, we began to promote the landing of cloud-native technology within Alibaba Group. Sixteen years of experience in cloud-native practice have enabled Alibaba Cloud's thinking and understanding of cloud-native to empower enterprises and help them realize IT information transformation.
In recent years, the concept of FinOps has been mentioned and adopted by more enterprises as enterprises accelerate to the cloud-native. FinOps is a cloud operation model that combines systems, best practices, and culture to improve the ability of organizations to understand cloud costs. This is an approach that brings financial responsibility to cloud spending, enabling teams to make informed business decisions. FinOps enhances collaboration among IT, engineering, finance, procurement, and the enterprise. It enables IT to evolve into a service organization focused on leveraging cloud technology to add value to the business. When cloud-native technology is intertwined with the concept of FinOps, it gives birth to the concept of Cloud Native FinOps. It is an evolution of FinOps in cloud-native scenarios.
Alibaba Cloud Container Service has launched an enterprise cloud-native IT cost management solution to help enterprises provide enterprise IT cost management, enterprise IT cost visualization, and enterprise IT cost optimization in cloud-native scenarios. Alibaba Cloud enterprise cloud-native IT cost management has five core functions.
Alibaba Cloud Container Service proposes a unique cost estimation model that combines billing and metering to solve the inconsistent lifecycles between business units and billing units in container scenarios. The cost strategy (payment type, savings plan, voucher, user discount, and bid fluctuation), allocation factor (CPU, memory, GPU card, and GPU memory), resource form (ECS/ECI/HPC), and other factors are included in consideration, which realizes the cost estimation in Pod dimension and the cost allocation of cluster proportion. Through bill analysis, all resource costs of the cluster in one phase are aggregated and then combined with the cost allocation capability of the Pod dimension. A complete cost allocation and estimation model for cloud-native container scenarios is realized.
It supports four dimensions of cost insight: cluster, namespace, node pool, and application (label wildcard matching). The cluster dimension focuses on the distribution of cloud resources, the trend change of resource cost, the ratio of cluster water level to waste, and the trend and prediction of cluster cost. It can assist IT administrators in accurately judging the trend of cost consumption and prevent scenarios that exceed the budget. The namespace focuses on cost allocation, supports short-term cost estimation and long-term cost allocation, supports scheduling water level, resource usage, and correlation analysis of cost trends, assists department administrators in cost estimation, drills down to analyze cost waste, and improves department resource utilization. The node pool dimension focuses on resource cost planning and management and helps IT asset administrators optimize resource match and payment policies through correlation analysis of instance type, unit core time, scheduling water level, and utilization water level. The application (label wildcard matching) dimension focuses on cost optimization in domain scenarios, such as big data, AI, offline jobs, online applications, and other upper-layer application scenarios. You can use the cost insight of the application dimension to estimate real-time costs and calculate costs at the task level.
The cost optimization function and solution of the whole scene can be supported by data, and the cost reduction and efficiency can be realized through the cost insight of four dimensions.
Alibaba Cloud Container Service provides full-scenario resource profiling, cost optimization capabilities, and solutions for different enterprise business scenarios (please see the end of this article):
In addition, most of the cost optimization strategies of enterprises need to be supported by business scenarios. Customization and secondary development still exist in many scenarios. Therefore, the cost insight capabilities provided by cloud-native IT cost management are completely decoupled from those provided by upper-layer optimization solutions. The cost insight capabilities of four dimensions can be used to measure and evaluate cost optimization methods that cover all scenarios.
Multi-cloud is a new trend for enterprises to migrate to the cloud. The billing models of different cloud vendors are quite different, such as annual and monthly payment methods of domestic cloud service providers, credit card withholding/payment methods of international cloud service providers, the savings plans supported by some cloud service providers, and reserved instances. All of these pose more challenges to the cost analysis capabilities of the multi-cloud management. The enterprise cloud-native IT cost management of Alibaba Cloud Container Service provides unified billing and inquiry access and default implementation for cloud service vendors and supports access to cost data of mainstream cloud service vendors and IDC self-built data centers. Cost management is carried out through a consistent cost allocation and estimation model for cloud-native container scenarios. In cooperation with the enterprise-level cloud-native distributed cloud container platform ACK One (Alibaba Cloud Distributed Cloud Container Platform), it can achieve a unified control plane for multi-cloud management and asset management.
Enterprise cloud-native IT Cost Management is a product capability or solution and an evolution of enterprise IT management, organizational process, and culture in the cloud-native era. The Alibaba Cloud Container Service Team and the Alibaba Cloud Apsara Infrastructure Team provide products and expert services completely covered by the FinOps concept through Alibaba Cloud Asset Manager.
As a domestic cloud product evaluated by the General Maturity Model for Cloud Resource-Oriented Financial Operation Capability, Alibaba Cloud Asset Manager assists enterprises in cost process management, cost insight, cost optimization, cost operation, helps enterprises establish the cloud-native overall IT cost platform, and accelerates comprehensive IT innovation and IT decision-making of enterprises.
In the face of Raymond's dilemma, how can he optimize costs through the enterprise cloud-native IT cost management solution provided by Alibaba Cloud Container Service?
Step 1: First, Raymond uses the cost analysis capabilities of the cluster to view the differences between the cost trends and the cost budgets of the cluster and come to the preliminary conclusion of cost exception.
According to the cost of the cluster, the principal waste is in cluster B. Then, the drill-down analysis can be performed mainly for cluster B.
Step 2: View the cost composition of the cluster and determine the optimization direction and drill-down strategy
In this cluster, computing resources are the main component of costs, so the direction of the problem can be guided to resource utilization and the unit price cost at the time of the kernel for further analysis.
Step 3: Check the resource utilization of the cluster and the unit price cost
The cluster has reached 78% of the scheduling level of the cluster, which is an ideal situation. There is a certain amount of space to continue scheduling without being too wasteful. Judging from the actual resource utilization rate, the real utility rate is only 3%, which indicates the resources have been allocated but not fully used. In addition, from the core unit price of the node pool, the unit price of one node pool containing spot instances is close to the pay-as-you-go unit price, which indicates that the specifications of the selected spot instances are unreasonable, resulting in the high price of the unit core.
Step 4: Drill down the application dimension to locate the problem application
The namespace dimension can locate that some namespaces have capacity changes in peaks and troughs. After the capacity expansion, the resource utilization does not fluctuate or change significantly, which indicates that scheduled scaling does not bring any benefits to the business.
The resource waste list provided in the namespace shows the name of the application where a lot of waste occurs. Enter the label of the application. You can see that the current application is empty, but it occupies the overall consumption of the cluster at 34.74%.
Raymond confirmed with R&D personnel that the scheduled scaling configuration was applied to a test service that had not been launched yet, and the number of copies of the scaling configuration was relatively large, resulting in a large waste of resources. In addition, due to the high cost of the spot instance combination in the cluster, the zone and specification of the new spot instance need to be configured. At this point, Raymond has reconfigured the timing scaling rules, corrected the configuration combination of spot instances, and solved the problem that has plagued him for a long time.
When we look back at Raymond's problems, they are all small things that may be encountered in actual production. These small things may cause large capital losses in enterprise IT cost management. The more complex the IT system is, the more automated the operation and maintenance system is required. Similarly, the richer the means to reduce costs and increase efficiency in the cloud-native, the more data and transparency the IT cost management scheme is required. Cost reduction and efficiency increase is the goal, emphasizing the result rather than the process. Relying on the enterprise cloud-native IT cost management, the goal of enterprise IT cost optimization can be achieved transparently, digitally, and automatically.
In the foreseeable future, the concept of FinOps will be mentioned and adopted by more enterprises. The ability and solutions to reduce costs and increase efficiency will spring up. However, the concept of IT cost management in most enterprises has not kept pace with the evolution of the architecture, which virtually brings a greater burden to the cloud-native transformation of enterprises. If you want to fully drive and implement cloud-native IT cost optimization strategies, you need to implement the concepts, tools, and processes of cloud-native IT cost management first. Only observable, quantifiable, and measurable optimization solutions can truly prove value.
Alibaba Cloud Enterprise Cloud-Native IT Cost Management helps enterprises implement the concepts, tools, and processes of enterprise IT cost management. It enables enterprises to digitally realize enterprise IT cost management and optimization in the process of cloud-native and become practitioners and leaders in the field of FinOps.
How to Build a Traffic-Lossless Online Application Architecture – Part 3
Gaming Industry | Network Microservice Governance Practice of Wanxinbuzhi
205 posts | 12 followers
FollowAlibaba Cloud Native Community - July 19, 2022
Alibaba Cloud Native - November 2, 2022
Alibaba Cloud Native - July 29, 2024
Alibaba Cloud Native Community - August 12, 2022
Alibaba Cloud ECS - June 3, 2021
Alibaba Clouder - October 12, 2020
205 posts | 12 followers
FollowLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreThis solution enables FinTech companies to run workloads on the cloud, bringing greater customer satisfaction with lower latency and higher scalability.
Learn MoreAn enterprise-level continuous delivery tool.
Learn MoreMore Posts by Alibaba Cloud Native