Ă—
Community Blog Practice on the Construction of a Production-level Observability System of Alibaba Cloud ACK

Practice on the Construction of a Production-level Observability System of Alibaba Cloud ACK

Watch the replay of the Apsara Conference 2024 at this link! By Shichun Feng, Head of the Alibaba Cloud Container Service Observability Team, sharing.

Watch the replay of the Apsara Conference 2024 at this link!

By Shichun Feng, Head of the Alibaba Cloud Container Service Observability Team, sharing the Practice on the Construction of a Production-level Observability System of Alibaba Cloud Container Service for Kubernetes (ACK).

_1

Let us first review the official annual statistics of CNCF for the last year.

In 2022, Kubernetes was applied in 76% of the online production environment, and this figure increased to 89% by 2023. In addition, the top 10 projects that related to the Kubernetes ecosystem also accounted for no less than 50% of production environment usage.

The Kubernetes containerized architecture has become the mainstream choice and de facto standard for production-level systems.

_2

Kubernetes containerization brings many benefits, but it also increases system architecture complexity. The maturity of an Infra team is tested by whether it can use more effective means to build a more stable O&M system.

Observability is one of the core capabilities for building an O&M system.

In the Magic Quadrantâ„¢ for Container Management released by Gartner in 2024, Alibaba Cloud, as the only one in Asia, continues to be the global leader.

Observability is a core evaluation item of container platform capabilities.

According to the analysis report on the public cloud container platform field released by the authoritative consulting agency Forrester in 2022, Alibaba Cloud became the global leader that is comparable to Google in this field for the first time.

ACK is highly recognized by analysts for its observability, powerful capabilities, seamless product experience, and efficient problem identification capabilities. Therefore, I would like to explain that observability is essential for building a user O&M system.

My sharing will focus on:

  1. The first part introduces the overview of the ACK observability system.
  2. The second part focuses on the recent progress of advanced Alibaba Cloud container observability in key areas such as AI scenarios, container networks, storage, and other scenarios.
  3. The last part introduces how the container observability system as a data driving force can help users build systems such as FinOps and AIOps.

First, let me introduce the function of the observability system and the typical O&M scenarios where it is used.

_3

The first and most important scenario, which many of our customers often ask about, is how we ensure the stable operation of theirbusiness services running on our container service clusters without any issues.

Here I give an example of a typical best practice. During the Paris Olympics this year, the ACK Pro cluster was honored to provide cloud infrastructure such as "bullet time" for efficiency improvement and multiple Olympic online systems. The observability capabilities here also provided important protection for multiple online systems of the Olympics, helping to ensure the smooth and successful operation of the Paris Olympics.

This year, the global dashboard is used to monitor online systems in real time, and AI scenarios also bring new challenges to the observation capabilities of GPUs and job scheduling, which will be reported in detail later.

_4

The second scenario is how to perform performance tuning to ensure the stability of production-level large-scale clusters in container scenarios.

The containerized architecture adds an additional container layer, which also increases the complexity of O&M. ACK enhances the observability of the container layer to make the container layer transparent.

It helps many customers ensure the stability of their businesses and clusters through the transparent container layer in production scenarios.

As shown in the example, we use the promotional business of a leading e-commerce platform as a case study. During peak business hours, we observe the sudden increase in business access to the core ControLPlane components such as the gateway and APIServer of the cluster, and help optimize the performance to ensure the stability during traffic peaks.

These two scenarios are great tests for the scenario coverage of the observability system and the construction of the operation and maintenance Ops system. They determine whether an exception can be observed, and if observed, whether the exception can be quickly displayed to the "right person" responsible for handling the exception. This ultimately ensures low MTTR and overall high SLA.

_5

The last typical observable application scenario, which is encountered daily, is the exception diagnosis of routine business.

The preceding figure shows the troubleshooting path of an online service. The ACK observability system provides comprehensive observability across all layers of the link and finally locates the problem.

First, once the O&M system is established, when a business exception occurs, it will initially be detected through alerts and notifications. For example, if pods are down, they can be restarted through container methods such as rescheduling and elasticity.

Then, the container observability provides comprehensive coverage across the entire link, including the business gateway traffic to easily observe traffic impact. By using the Prometheus multi-dimensional metrics, we can analyze the time when exceptions occur and the specific resource dimensions affected.

Container logs collected by Simple Log Service (SLS) can help you further dive into the application logic for more precise localization. Additionally, through distributed tracing and code-level profiling, you can identify the root cause of the exception and then resolve it to reach the desired state.

This scenario is a great test for the end-to-end coverage of each layer of the observability system. In the process of tracing the root cause of an exception, if there are gaps in the observability data link, it will require more manual effort to investigate the issue, ultimately resulting in a higher MTTR for fault recovery.

Next, let me briefly introduce the ACK observability system and our out-of-the-box product capabilities.

_6

After introducing the typical scenarios in which container observability is used, let me give an overview of the specific capabilities of the current container observability system.

This is a big picture of the container observability system, including an overview of the system and the value-added scenarios we provide based on the observability data.

First, in terms of the construction of container data sources, we divide them into four layers. Among them, the upper layer is closer to the user's business system, and the lower layer is closer to the infrastructure layer.

From top to bottom, the application /microservice layer is responsible for monitoring user business traffic, tracing, and profiling. The container monitoring layer monitors the health of the cluster and the applications deployed on the cluster. The operating system layer provides transparent kernel observation capabilities. The underlying cloud infrastructure layer provides observation capabilities of cloud infrastructure related to the cluster environment, such as hosts, networks, storage, and middleware.

For the container monitoring layer and the operating system layer, we have recently enhanced key scenarios.

We provide powerful data computing, storage, and analysis capabilities through the data /monitoring platform service capabilities of our Alibaba Cloud Observability team.

In the container scenario, we also exert greater incremental value with the driving force of observability data.

From top to bottom, we can first better perceive the business layer, analyze the problem profiling of business applications on the container architecture, locate the root cause of business exceptions, and manage flow control.

The container layer provides observability for ControlPlane to address the challenges of large-scale production-level clusters. The expert diagnosis system supports comprehensive exception diagnosis of the cluster based on multi-dimensional observation data and helps with cluster capacity planning.

In cluster optimization analysis, the data-driven power of observability is particularly evident, especially in terms of security, FinOps cost efficiency, and high reliability of the cluster.

In terms of data integration, the alert center provides default alert rules based on the experience of the ACK development team, covering routine O&M issues and offering out-of-the-box protection. In addition, the cluster-managed node pool will detect exceptions through rich observation data and automatically heal the cluster and other O&M environment problems.

In terms of data-driven automation, it supports advanced elasticity capabilities such as metric-based HPA to help automatically maintain the stability of frequently changing businesses and strike a balance between cost efficiency and stability.

_7

In the observability system of ACK, event data link has been enhanced in many ways. In addition to the original Kubernetes events in the community, cluster resource management and control, cluster control plane events, operating system kernel exception events, and event sources at the underlying cloud infrastructure layer of the cluster are enhanced to form an end-to-end full-stack container observability from the event chain.

_8

In the metric system, we continue to expand the observability capabilities for all container scenarios within the unified Prometheus metrics protocol. For example, we enhance GPU observability to support major scenarios such as AI training on containers.

In terms of product consistency and user experience, through the upgraded pre-built monitoring dashboard, ACK allows users to uniformly view monitoring data while managing clusters and applications, providing a seamless experience that integrates monitoring, management, and control.

In ACK Pro clusters, all users can obtain the O&M system and experience that we have accumulated for major customers, such as the monitoring dashboards of the control surface components APIServer and ETCD.

_9

In terms of tracing, ACK currently provides tracing capabilities in three levels to meet the requirements of different scenarios:

  1. The eBPF-based non-intrusive application monitoring capabilities;
  2. The OpenTracing data access that supports the OpenTelemetry standard protocol;
  3. JAVA and Golang (precompiled) APM capabilities that are intrusive but have powerful profiling capabilities.

_10

Thanks to the maturity of eBPF, we currently provide eBPF-based application monitoring capabilities that are non-intrusive and deep into the kernel layer and require low overhead.

It supports cluster topology awareness. This allows you to observe the network traffic of all layers in the cluster in a non-intrusive manner. You can also drill down to the details to view the end-to-end network traffic.

So far, we've taken a quick overview of the current Alibaba Cloud container observability system. Next, I would like to focus on sharing with you the new observability challenges we have recently faced under the key scenarios of our customers and our current solutions.

_11

Let's first focus on the typical observability requirements in AI scenarios.

At present, our users' requirements for container platform capabilities in AI scenarios have gradually changed from simply "being able to run" to meeting the standards of a "production-level" system.

The containerized architecture provides high flexibility and resource reuse and is also a mainstream choice for AI scenarios.

Based on the experience accumulated in customer cases, we have summarized the observation requirements of AI scenarios for GPU scenarios:

  1. GPU is expensive and has a high rate of task failures. Automatically detecting faulty GPUs and enabling self-healing of tasks and environments is a fundamental requirement to ensure uninterrupted AI production tasks.
  2. In the production environment, such as when performing inference tasks, ensuring the stability of the cluster and AI business tasks requires not only the O&M system mentioned earlier but also addressing the additional challenge of observing GPU resources.
  3. To tune models and understand why tasks are running slowly, we need to provide more transparent observability capabilities to help optimize model performance and assist with parameter tuning.

_12

The current containerized architecture for general AI production systems can be vertically divided into multiple layers as shown in the figure.

Among these layers, frameworks like Ray, Slurm, Spark, and large-scale RAG can be connected to ACK by using the ACK AI suite for targeted optimization. Each layer also provides enhanced observation capabilities. You can also use the unified metric monitoring capabilities of Alibaba Cloud Prometheus to obtain a consistent observation experience.

_13

ACK provides out-of-the-box GPU monitoring dashboards, GPU task failure status detection, and cost analysis capabilities for GPU applications.

This helps ensure the stability of AI business and optimize task parameters.

_14

In addition to the GPU performance monitoring metrics provided by ACK,

we also introduced a low-consumption, non-intrusive, and lightweight GPU Profiling capability based on the eBPF technology.

This feature allows you to analyze the time-consuming bottleneck of AI tasks at the kernel level in Pytorch scenarios. This improves the performance optimization and diagnosis efficiency of AI tasks.

_15

Here is a GPU Profiling timeline chart that correlates multi-level metrics to comprehensively analyze the time consumption of host processes, system calls, CUDA Kernel, NCCL communication, and Pytorch operators to quickly locate the performance bottleneck of AI tasks.

_16

The second major scenario is the container network observation capability.

Due to the long process, extensive time span, and high experience requirements for network issue diagnosis, locating container network problems is a huge challenge.

The container network is the most complex scenario we deal with. The Container network experts in our team need to spend a lot of time troubleshooting user network issues case by case.

_17

Here is KubeSkoop, which is a toolkit for diagnosing Kubernetes network issues developed by network experts in our container service team.

It provides enhanced container network observability based on the eBPF technology, simplifying network issue diagnosis from using tools like telnet, routetrace, and Wireshark to just a single click for diagnosis and a quick look at the monitoring data.

_18

For everyday network connectivity issues, KubeSkoop provides end-to-end full-link connectivity checking capabilities.

Prometheus allows you to backtrack historical access to relational data, troubleshoot past exceptions, and draw the network topology of the entire cluster.

_19

For network latency issues, KubeSkoop further detects network latency exceptions based on connectivity determination.

Due to the dynamic scheduling nature and complex network structure of the Kubernetes architecture, packet capturing also needs to be performed in a distributed manner, simultaneously across multiple nodes, to construct a complete end-to-end packet transmission diagnosis scene.

_20

Further dissecting the technical principles, KubeSkoop provides standard Prometheus monitoring metrics and standard event logs through the Net-Exporter component on ACK.

It collects kernel-level network stack information based on the eBPF technology.

_21

Container storage is also a scenario that requires observability to ensure business stability.

Disk mounting exceptions, write issues, and capacity issues are also common in the O&M environment.

In some high-throughput scenarios, such as frequent I /O writes in AI training, you also need to observe the operations and help optimize performance.

_22

ACK has unified access solutions for various storage media in container scenarios through the CSI standard container storage solution.

We have also enhanced the CSI implementation of ACK to provide enhanced container storage observation capabilities, including Prometheus and standard Kubernetes events and logs.

Provide enhanced container storage and observation capabilities for upper-layer stateful scenarios in a unified manner.

_23

Especially for high-throughput scenarios, such as multi-node multi-GPU AI training and large-scale biological computing.

Provide throughput estimation capabilities to help select appropriate storage media solutions and optimize I /O bottlenecks.

_24

The SysOM kernel-level container observation capability provided by the Alibaba Cloud operating system team last year has also helped solve many containerization problems after a year of accumulation.

The most typical problem is the memory "black hole" problem, such as when a Java application in a container requests more memory than the JVM's page cache can handle, leading to PodOOMKilling in the container environment.

In the SysOM kernel-level memory profiling, you can clearly see the sources of memory consumption. Combining this with the Koordinator hybrid solution allows you to solve these memory "black hole" problems in a non-intrusive and fine-grained manner.

_25

The multi-cloud architecture is also increasingly adopted by many users. In multi-cloud scenarios, the ACK ONE multi-cluster fleet allows you to manage ACK clusters and user-created Kubernetes clusters that are managed by ACK registered clusters in a unified manner, and provides unified monitoring dashboards and cost analysis capabilities.

In complex multi-cloud cluster management scenarios, ZEEKR uses ACK ONE to manage multiple Kubernetes clusters and unify cost analysis, reducing resource usage by 25%.

The above covers the construction of observable data links in key vertical scenarios to further leverage the value of observable data. We hope to enable users to build Ops systems with the help of these data.

_26

The Alibaba Cloud Container FinOps suite provides powerful data-driven cost efficiency based on observability capabilities to help make cost-optimized decisions.

The FinOps suite has already helped multiple leading customers achieve cost optimizations to varying degrees, yielding excellent results.

_27

The ACK FinOps suite provides out-of-the-box multi-dimensional cost dashboards for clusters and supports fine-grained ledger analysis down to the Pod level.

It also provides cluster cost waste analysis, along with optimization and savings recommendations.

Driven by data, cost reduction and efficiency enhancement can be achieved in container scenarios.

_28

With so many container observation capabilities we have introduced, our users must be confused about where to find the specific functionalities they need.

Don't worry, our container observability capabilities are also moving into the next era.

We will launch the ACK AI Assistant 2.0.

It combines the data driving force of the ACK observability system with the expert diagnosis experience accumulated by the Alibaba Cloud Container Service team.

Instead of looking for abnormal data in a complex dashboard, you can be informed about data and status anomalies directly through ChatOps. This greatly shortens the interaction path and greatly reduces the Mean Time To Resolution (MTTR) for troubleshooting.

The AI assistant, powered by Alibaba Cloud's Qwen as a robust intelligent engine, combined with the extensive and sophisticated observability system of ACK, represents the next-generation observable functionality that I, as a technologist, have long dreamed of.

0 1 0
Share on

Alibaba Container Service

171 posts | 31 followers

You may also like

Comments

Alibaba Container Service

171 posts | 31 followers

Related Products