In 2025, AI agents are moving from the lab to large-scale production. From code assistants used by developers daily to intelligent customer service in enterprise service scenarios, to multi-agent collaboration systems of ever-increasing complexity, AI agents are reshaping software development and business operations at an unprecedented pace.
However, once agents are actually running, a critical problem emerges: the actual runtime behavior of AI agents is difficult to observe, trace, and govern.
A coding agent autonomously and without authorization modifies core configuration files overnight, with no way to know what changed or why. An intelligent customer service agent autonomously issues a "cancel order" instruction, yet the decision logic, tool calling chain, and token resource consumption cannot be reviewed. A multi-agent collaborative job fails midway, and the failure node and root cause are difficult to pinpoint.
These issues point to a common requirement: AI agents need comprehensive observability. Moreover, this observability cannot remain at the shallow statistical dimension of "request success/failure" — it must deeply cover AI agent-specific runtime aspects such as LLM invocation, tool execution, multi-round inference, and memory retrieval.
Based on the OpenTelemetry (OTel) community standard and its in-depth practices in observability fields, Alibaba Cloud has developed a complete data collection solution that covers three types of agent forms. Building on the OTel GenAI semantic conventions, Alibaba Cloud has released the LoongSuite GenAI semantic conventions for observability. This paper will systematically introduce the design concept, technical implementation and use of this scheme.
The AI agent market is thriving and highly diverse. The runtime models, deployment environments, and use cases of different agent types vary significantly, and their observability and audit needs differ accordingly. We classify mainstream AI agents on the market into three categories:
| Agent Form | Example | Runtime Characteristics |
|---|---|---|
|
Coding Agent (Intelligent Code Assistant) |
Claude Code, Cursor, Codex, and Qoder | Runs on the developer's local machine as a CLI or IDE plugin, operating deeply on the file system and terminal. |
| Personal General Assistant | OpenClaw, Hermes Agent, and QwenPaw | Runs as a standalone service, providing end users with multi-turn dialogue, tool calls, and intent routing. |
| Framework-based Agent | Based on frameworks such as LangChain, AgentScope, and Dify | Business logic is implemented by developers based on mainstream programming languages, with high flexibility and multiple scenarios. It runs in the form of standard Python, Go, and Java applications. |
No matter what form is adopted, AI agents will encounter three common problems after large-scale use:
Core design principle: Adapt the data collection capability to the native running mode of the AI Agent instead of forcing the Agent to adapt to the data collection tools.
Coding agents run on the developer's local machine, where all core behaviors — code edits, file creation, terminal command execution — happen in the local environment, completely invisible to traditional server-side agents. To address this, we built LoongSuite Pilot, a client-side data collection platform purpose-built for coding agents.
| Agent | Coverage |
|---|---|
| Claude Code | Complete event chain including user questions, tool calling (before/after), job completion, context compression, sub-agent lifecycle, and notifications |
| Codex | Session startup, user questions, tool calling (before/after), and job completion |
| Cursor | 12 types of event coverage, including session lifecycle, tool calling, questions, and sub-agents |
| Qoder (IDE) | IDE history + local database dual-channel collection |
| QoderWork | Hook logs + database + session file three-way parallel collection, the most comprehensive coverage |
3.2 Personal General-Purpose Assistant: One-Line Command for Full Observability and Audit
Personal general-purpose assistants usually run as standalone services, providing end users with dialogue and task-execution capabilities. For this type of agent, we provide a dedicated plugin that enables full tracing with a single command.
Design philosophy
Take OpenClaw as an example. Although its built-in diagnostics-otel extension can output Metrics and some Trace, it adopts an event-driven architecture. Span is created independently for each event, and there is no parent-child relationship between each other and Trace Context propagation. In essence, it is a group of "standalone data points". The openclaw plug-in of LoongSuite is a complete distributed tracing by design-all Span share the same traceId and are connected together into a call tree through an explicit parent-child relationship.
Span Semantic Model
| Span name | Type | Record content |
|---|---|---|
| enter_ai_application_system | ENTRY | Request entry — who sent the message and which channel it came from |
| invoke_agent | AGENT | Agent invocation — which agent is executing and the session ID |
| react | STEP | A ReAct iteration procedure — reflection, tool selection, and model invocation |
| chat | LLM | LLM invocation — model, token consumption, and input/output messages |
| execute_tool | TOOL | Tool call-tool name, parameter, return value, error |
Each type of span is connected to a complete trace tree by using parent-child relationships. O&M personnel can view the number of large model calls, token consumption, tool call list, time-consuming nodes, and fault information of a single request.
Essential differences from built-in observability
Compared with the built-in observability capabilities of OpenClaw, LoongSuite plug-ins are different in two aspects:
Link integrity. Built-in observability is usually flat and independent, and there is no correlation between events. However, our plug-in is based on the OTel Context propagation mechanism to ensure that ENTRY → AGENT → STEP → LLM / TOOL forms a complete call tree, which can restore the complete picture of a request.
Data richness. Built-in observability often only records basic metrics such as model usage, while our plug-ins fully record fields such as gen_ai.input.messages, gen_ai.output.messages, gen_ai.system.instructions, gen_ai.tool.call.arguments, and gen_ai.tool.call.result to meet the needs of in-depth audit and troubleshooting.
The same plug-in mechanism already covers personal general-purpose assistants such as Hermes Agent and QwenPaw.
For agent applications built on frameworks such as LangChain, AgentScope, and Dify, the runtime behaves like a traditional Python application. We provide the LoongSuite Python Agent (deeply customized from OpenTelemetry Python Contrib), which achieves zero-code automatic instrumentation with a single command.
Quick start
# 1. Install the LoongSuite Python Agent pip install loongsuite-distro
# 2. Auto-detect and install the required instrumentation libraries
loongsuite-bootstrap
# 3. Start with one command; probes are injected automatically
loongsuite-instrument \
--traces_exporter otlp \
--service_name my-agent-app \
python my_agent_app.py
loongsuite-bootstrap automatically scans for installed frameworks (such as langchain, dashscope, and mcp) in the current environment and installs the corresponding instrumentation packages-developers do not need to manually select and install them.
Framework Coverage
At present, 16 instrumentation libraries have been covered in the LoongSuite Python Agent, covering the mainstream AI agent development framework:
| Instrumentation | Supported frameworks | Type |
|---|---|---|
| LangChain / LangGraph | langchain_core >= 0.1.0, langgraph >= 0.2 | Trace |
| AgentScope | agentscope >= 1.0.0 | Trace + Metrics |
| Dify | dify | Trace |
| MCP Client | mcp >= 1.3.0 | Trace |
| OpenAI / OpenAI Agents | openai >= 1.26.0 | Trace + Metrics |
| Claude Agent SDK | claude-agent-sdk >= 0.1.0 | Trace |
| Google ADK | google-adk >= 0.1.0 | Trace |
| CrewAI | crewai >= 0.80.0 | Trace |
| Qwen-Agent | qwen-agent >= 0.0.20 | Trace |
| QwenPaw (CoPaw) | qwenpaw >= 1.1.0 | Trace |
| Hermes Agent | openai >= 1.0.0 | Trace + Metrics |
| Agno | agno | Trace |
| LiteLLM | litellm >= 1.0.0 | Trace |
| DashScope | dashscope >= 1.0.0 | Trace |
| Mem0 | mem0ai >= 1.0.0 | Trace |
| Vertex AI | google-cloud-aiplatform >= 1.64 | Trace |
Automatically Recognized Span Types
The probe automatically detects and generates multiple GenAI span types, covering the entire agent lifecycle:
After accessing the preceding collection capabilities, users can obtain observability views in the following dimensions. Take Claude Code as an example. If you want to enable Agent Observability, you only need to log in to CloudMonitor 2.0 Console, click the corresponding card in the access center and follow the steps to complete the installation and access with one line of command.

The complete execution process of the agent is presented in the form of a trace tree, from the user request entry (ENTRY) to the agent decision (AGENT), inference step (STEP), LLM call (LLM), and tool execution (TOOL). The hierarchical relationship is clear at a glance. For complex tasks with multiple rounds of ReAct, you can use Step Span to quickly locate which iteration has a problem, and then go to the LLM or Tool Span in the round to analyze the root cause.
Troubleshooting pattern: When an agent executes a 10-round ReAct process, you can first use Step Span to identify which round of the problem occurred, and then analyze the specific step in the round. This top-down troubleshooting method greatly improves the fault locating efficiency of complex agents.

Based on gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens , as well as cost fields extended by Alibaba Cloud (input_cost, output_cost, and total_cost), you can:

Through gen_ai.session.id, gen_ai.turn.id and gen_ai.step.id to build a three-level identification system to achieve:
Full conversation traceability across multiple rounds of conversation
Step-level fine-grained analysis in a single-round dialogue
Session path analysis and user behavior insights

You can record the tools that are called by the agent, the parameters that are specified, the results that are returned, and the duration. For the Coding Agent, this means that every file read or write and every command execution is documented. For MCP protocol calls, complete request-response auditing is also provided.
Behavior Analysis Dashboard
The top count card divides tool calls into dimensions such as command execution, file reading and writing, search, web browsing, and MCP calls by behavior type, and marks the categories with abnormally high call volume with striking red or orange colors to provide a quick snapshot of the overall behavior composition. The right side displays the number of active sessions and the number of users at the same time, which is convenient for correlating the behavior popularity with the usage scale. The session statistics table below is expanded by session and records the number of calls in each session in each dimension of behavior. This allows you to locate the sessions and users in which high-frequency operations are concentrated.

Tool Call Distribution
The tool invocation distribution page presents the tool usage structure from two perspectives. The pie chart on the left shows the type proportion of all tool calls (such as Read, Write, Bash, TodoWrite, etc.) to help the team understand which tool capabilities the agent relies on most. The pie chart on the right shows the distribution of MCP tool calls independently, revealing which external capabilities are frequently called in cross-system integration. The trend comparison chart below shows the changes in the number of calls for each tool type in a timeline, making it easy to identify phased changes in call patterns-for example, a surge in Bash calls on a certain day may indicate batch script tasks or abnormal behavior.

Security Audit Overview
The Overview page compresses the security situation of AI agents into a screen-readable risk snapshot based on the multi-dimensional high-risk operation count within a specified time window. The funnel on the left side gradually converges from full sessions to sessions with security risks. This visually shows the proportion of risk surfaces. On the right side, metrics such as high-risk command execution, outbound web requests, outbound command-line requests, sensitive file access, and prompt injection are displayed side by side. With the comparison data, the security team can quickly determine whether the current risk level is abnormal without in-depth details.
What is particularly noteworthy is the count of high-risk operations after the prompt injection event. Ordinary high-risk operations may originate from the reasonable requirements of the task itself, while high-risk behaviors triggered by injection are strong threat signals-this means that the injected malicious instructions have driven the Agent to execute. Even if there is a false positive, such signals should trigger a manual review at the highest level, rather than waiting for further confirmation. Therefore, the “number of tool-calling sessions following prompt injection” is the highest-confidence Indicator of Compromise (IoC) in the entire overview. The priority of 3 such sessions is often higher than that of hundreds of ordinary high-risk commands.

High-Risk Session Tracing
Two-stage drill-down capability is provided below. The upper layer is a high-risk session risk score table, which aggregates the risk counts of each dimension (injection hits, high-risk operations, sensitive file accesses, and outbound information) by session, and automatically sorts the comprehensive risk score to present the sessions that require the most manual intervention. The security team does not need to screen logs one by one. Instead, the security team directly starts tracing from the session with the highest risk, greatly reducing the time window from discovery to response.
The lower layer is a high-risk event summary table, which drills risk down to individual event granularity-specific time, user, session, event type, tool name involved, threat type, and complete context content, providing security analysts with the original evidence required for final characterization.

The data capabilities of the observability system of Alibaba Cloud AI Agent are built based on the self-developed LoongSuite GenAI Observability Semantic conventions. This specification is based on the OTel GenAI standard in the community and fills the semantic gaps in real business scenarios.
As early as the beginning of 2024, OpenTelemetry started driving GenAI semantics specification development, aiming to establish a unified observability data language. Community standards have laid an important foundation:
However, community standards inherently need to balance broad applicability with long-term stability, resulting in a relatively cautious pace of evolution. The current OTel GenAI semantic conventions is still in Development status, and many new concepts and scenarios are still being absorbed and converging.
In practice at Alibaba and Ant Group, we encountered many more complex and granular real-world scenarios. For example, a seemingly simple scenario of "ordering milk tea with Qwen" actually involves cross-domain coordination among multiple business systems, including Qwen Agent, Flash Sale Agent, Amap Agent, and Alipay Agent. These scenarios place higher demands on semantic expressiveness.
To this end, based on the OTel GenAI community standard and drawing from extensive internal hands-on experience, we released the LoongSuite GenAI Observability Semantic conventions. In 2026, the specification was officially open-sourced as a vendor extension standard for OTel GenAI, with plans to gradually contribute optimization capabilities upstream to the community.
Extension 1: Entry Span and Step Span — Making Complex Agent Call Chains Readable
Problem background: When an agent executes a long-running job, a single trace may contain hundreds or even thousands of spans. The native standard cannot distinguish business levels, making call chains cluttered and difficult to analyze.
Semantic Modeling:
This semantic conventions has been implemented in multiple scenarios such as OpenClaw, QwenPaw, and Hermes Agent.
Extension 2: Skill Semantics — Making Business Function Domains Observable
Background: In scenarios such as e-commerce shopping assistants, commands are routed to the corresponding Skill after the agent understands the intent. Existing semantic conventions lack an abstraction of the business function aggregation layer of Skill.
Semantic Modeling: gen_ai.skill.* attribute family is added:
| Property | Type | Description |
|---|---|---|
| gen_ai.skill.name | string | Skill name (e.g., add_to_cart) |
| gen_ai.skill.id | string | Skill instance identifier to distinguish canary/A/B experiments |
| gen_ai.skill.description | string | Feature description |
| gen_ai.skill.version | string | Version number |
At the current stage, these attributes are attached to the execute_tool Span and quickly landed. At the same time, we have implemented an independent invoke_skill Span scheme and submitted a proposal to the OTel community (#3540).
Downstream value: Observability Platform can be aggregated and analyzed by functional domain to quickly identify "which Skill has the highest error rate", compare "whether the latency of the new version of Skill is degraded after it is launched", and measure "the proportion of Skill execution time spent on LLM calls".
The value of semantic conventions lies not only in documents, but also in engineering implementation. We implemented GenAI Utils in the probe as an engineering capability layer for the LoongSuite SemConv:
Supported Invocation types include LLMInvocation, InvokeAgentInvocation, CreateAgentInvocation, ExecuteToolInvocation, EmbeddingInvocation, RetrieveInvocation, RerankInvocation, and MemoryInvocation, covering the entire lifecycle of GenAI.
GenAI Utils has versions of Python, Node.js, and Go, and the Java version will be released soon. Among them, Python and Node.js versions have been open-sourced, and the rest will be open source one after another.
The Alibaba Cloud Agent observability and audit solution is applicable to the following scenarios:
| Role | Scenarios | Issues addressed |
|---|---|---|
| Enterprise Security Administrator | Audit the Operation Behavior of the Coding Agent | Tracks the read and write operations and command execution records of the Agent on the code base to prevent data breach and unauthorized operations. |
| R&D Effectiveness Team | Monitor team AI-assisted development efficiency | Analyze the agent usage frequency, code adoption rate, and task completion time, and evaluate the AI input-output ratio |
| FinOps / Cost Administrator | Manage the token cost of large models | Split token consumption by project /team /individual, identify cost anomalies, and develop budget strategies |
| AI Application Developer | Debugging Agent Applications Based on LangChain / Dify / AgentScope | Locate Agent decision errors, tool call failures, and LLM output exceptions by using the Trace tree |
| Platform Operations Staff | Ensure the stability of Agent services | Monitor Agent call chain latency, error rate, and health of dependent services |
| Compliance Auditors | Meet internal enterprise compliance requirements | Provides complete agent operation logs and audit trails |
| Agent Product Team | Optimize the Personal General Assistant product experience | Analyze user session paths, tool usage preferences, and churn nodes to guide product iteration |
The popularity of AI agents has greatly improved production and office efficiency, and also put forward new requirements for observability, auditability, and governance capabilities. Different from traditional microservices and web applications, AI Agent integrates new operation modes such as LLM calls, tool execution, and multi-turn reasoning. It must support exclusive data collection and semantic standards.
The Alibaba Cloud LoongSuite solution provides full coverage for the following types of mainstream agents:
More importantly, the LoongSuite GenAI Observability Semantic conventions, which is based on the OTel GenAI Semantic conventions, is open source. It uses key semantic extensions such as Entry, Step Span, and Skill semantics to fill the semantic gaps of community standards in real business scenarios. With the engineering package of GenAI Utils, this ensures unified standard implementation and efficient iteration.
The ultimate goal of a unified semantic conventions is not to produce a single document, but to enable all users and vendors who use the specification to see, analyze, govern, and evolve the rapidly growing GenAI applications.
Apache RocketMQ 5.5.0 Open Source LiteTopic: Dedicated Channel for Millions of AI Sessions
740 posts | 60 followers
FollowAlibaba Cloud Native Community - May 26, 2026
Alibaba Cloud Native Community - April 16, 2026
Alibaba Cloud Native Community - May 18, 2026
Alibaba Cloud Native Community - June 4, 2026
Alibaba Cloud Native Community - August 25, 2025
Alibaba Cloud Native Community - November 24, 2025
740 posts | 60 followers
Follow
CloudMonitor
Automate performance monitoring of all your web resources and applications in real-time
Learn More
Simple Log Service
An all-in-one service for log-type data
Learn More
Application Real-Time Monitoring Service
Build business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn MoreMore Posts by Alibaba Cloud Native Community