×
Community Blog New Scenarios and New Capabilities: Observability Innovations in the AI-Native Era

New Scenarios and New Capabilities: Observability Innovations in the AI-Native Era

In Apsara Conference 2024, Alibaba Cloud launched an AI-native full-stack observability platform, aimed at helping businesses efficiently and cost-eff.

Watch the replay of the Apsara Conference 2024 at this link!

With the explosive popularity of generative AI, AI-native applications have become a key focus for enterprises in the AI era. The complex application chain topologies, component dependencies, and deep insights into large models, driven by new technologies, data types, and workflows, give rise to diverse and intricate operational needs. These requirements necessitate observability to ensure service SLAs and a seamless end-user experience.

Therefore, observability solutions tailored for the AI-native application stack are becoming increasingly important. In this Apsara Conference, Alibaba Cloud officially launched an AI-native full-stack observability platform, aimed at helping businesses efficiently and cost-effectively build observability systems for AI-native scenarios.

_1

For AI infrastructure, Alibaba Cloud provides high-performance compute nodes through its "Lingjun" service, which is designed with an integrated hardware and software approach. The Lingjun cluster consists of Panjiu servers and a high-performance RDMA network, delivered in a group-compute node configuration. Leveraging the AI-native full-stack observability platform, monitoring data from components such as GPU, RDMA, and Nimitz can be reported to Prometheus through the Pushgateway protocol. This enables the collection and observation of over 1,000 metrics across various aspects including CPU, memory, disk, network, and computational power at the IaaS layer. As a result, observability at both the cluster and node levels for resources like GPUs and RDMA is achieved.

2

To further enhance enterprises' ability to manage heterogeneous resources, Alibaba Cloud offers the ACK Lingjun managed cluster and AI suite. This solution abstracts the management of various heterogeneous resources into a unified system and provides a standard Kubernetes cluster environment, offering an efficient and flexible one-stop AI platform. The ACK Lingjun cluster comes with Prometheus pre-installed, enabling the one-click collection of observability data for all resources and components across the entire Kubernetes cluster. On each node, an enhanced GPU-Exporter from Alibaba Cloud exposes DCGM service metrics, which are then displayed through cluster-node-pod dimensions to show the status of GPU resources. Additionally, the AI suite supports efficient GPU program management and isolation through GPU shared scheduling and topology-aware scheduling, with its GPU monitoring 2.0 built on NVIDIA DCGM. These features help enterprises more intuitively observe and manage their heterogeneous computing resources.

_3

To enable enterprises to build machine learning platforms more efficiently and conveniently, Alibaba Cloud provides the Platform for AI (PAI) at the PaaS layer. This platform covers the entire process, from data preparation to model development, training, and deployment. The AI-native full-stack observability platform enhances PAI with full-stack observability capabilities, supporting online inference metrics for EAS, resource metrics at the DLC training job, node, and LRN levels, as well as metrics for container components, nodes, and pods. It also collects and stores observability data from the underlying infrastructure's compute nodes, providing a ready-to-use, complete observability dashboard.

_4

MaaS (Model as a Service) is a new AI-native service concept that allows enterprises to avoid the complexities of model training, deployment, and maintenance. By simply calling an API and providing the appropriate input data, users can obtain model predictions or analysis results, thus lowering the barrier for businesses and individuals to apply AI technology. Alibaba Cloud leverages Alibaba Cloud Model Studio to provide various model services, such as inference and fine-tuning, through standardized APIs. Alibaba Cloud Model Studio writes observability data from various large models and business gateways into the AI-native full-stack observability platform. Prometheus uses a streaming ETL tool to convert log data into metrics in real time, distributing them to each tenant, achieving comprehensive observability at the API level of Alibaba Cloud Model Studio. The collected observability data is applied to different use cases, including performance monitoring (latency/throughput/resource utilization), stability assessment (data quality/drift/anomaly insights), cost management (tokens/computational resources), and security and compliance (privacy/regulatory compliance).

5

Compared with cloud-native applications, AI-native applications aim for better model efficiency and effectiveness. Therefore, observability for AI-native applications focuses on improving inference performance, optimizing the quality of model inputs and outputs, and managing resource consumption effectively. Observability for AI-native applications requires handling higher-dimensional data, especially the complexity involved in natural language processing, which necessitates semantic analysis of model outputs. To achieve data collection and reporting of LLM Trace semantics described above, the capability to automatically collect instrumentation and connect the server to report data is required.

To better observe AI-native applications developed in Python, Alibaba Cloud has introduced a self-developed Python Agent based on the OpenTelemetry Python Agent. This agent supports mainstream frameworks and models, both domestic and international, such as LLamaIndex, LangChain, Qwen, and OpenAI. It also supports the latest OpenTelemetry LLM semantic conventions, enabling detailed instrumentation with support for custom attribute transmission. This provides richer metrics, traces, and continuous profiling data, along with flexible sampling policies, fine-grained control, and dynamic configuration. It also offers multiple performance diagnostics and data visualization dashboards, significantly reducing the observability threshold and providing a solid foundation for the stable operation and efficient maintenance of AI-native applications.

_6

With the AI-native observability solution, Alibaba Cloud offers out-of-the-box, end-to-end real-time observability, alerting, and diagnostic capabilities that cover large model applications and infrastructure. This helps businesses more effectively ensure the efficient use of resources and the ongoing success of their operations during complex digital transformation processes.

0 1 0
Share on

Alibaba Cloud Native

206 posts | 12 followers

You may also like

Comments

Alibaba Cloud Native

206 posts | 12 followers

Related Products