The Second Half of the Enterprise Agent Era: How to Make Agents Smarter the More They Are Used?

This article introduces AgentLoop, Alibaba Cloud's one-stop platform that enables enterprise AI agents to continuously self-evolve through full-stack observability and automated evaluation.

Agent evolution typically manifests in two scenarios. First, personal productivity agents (e.g., Coding or general-purpose assistants) that leverage memory and user profiling to improve with usage. Anthropic’s Economic Index notes that long-term Claude users see a 3–5% higher success rate than new users. Second, enterprise business agents (e.g., customer service or internal data analysts). Unlike personal agents, enterprise agents often remain stuck in manual monitoring and optimization loops, failing to accumulate institutional knowledge effectively. This article focuses on solving the latter.

I. The Current State of Enterprises Manually Crafting Agent Evolution Flywheels

The evolution flywheel is typically divided into 4 steps: data collection, dataset construction, performance evaluation, and evolutionary asset accumulation. Although the pipeline of the model and Agent evolution flywheel is similar, there are more factors that influence Agent behavior.

A model task refers to a single call of a model, including the input to the model and the output of the model. In contrast, an Agent task forms a topological graph or network, involving retrievals, planning, tool calls, browser interactions, intermediate states, reflections, rollbacks, and parallel subtasks. This is because, in addition to model calls, there are retrievals, planning, tool calls, browser visits, intermediate states, reflections and decision-making, rollbacks, and even multiple parallel subtasks.

Because there are more factors affecting Agent behavior, the new engineering difficulties brought by the evolution flywheel are difficult for the previous LLM-as-Judge paradigm to cope with.

Challenge 1: Data Collection – From Tuples to Topologies

The LLM-as-Judge paradigm collects a (prompt, completion) tuple; the schema is clean, and storing logs is enough. Agent behavior assessment needs to collect a trajectory (execution path): the input and output shape of each step is different. Retrieval returns a list of chunks, tools return structured JSON, browsers return DOM fragments, and models return token streams. To string these heterogeneous events together according to chronology and causal classification without losing intermediate states or parent-child call relationships—plus token usage, latency, and error codes—leads to storage and tracking costs that are dozens of times higher than those of LLM-as-Judge. In addition, OpenTelemetry's GenAI semconv is still in the draft stage, and there is currently no de facto standard, so enterprises are basically reinventing the wheel.

Challenge 2: Dataset Construction – The Ambiguity of "Good" Trajectories

LLM-as-Judge selects prompt-completion pairs from logs, filtering them simply by token length, confidence, and human feedback. A trajectory (execution path) includes:

Planning: How it breaks down tasks into sub-goals
Retrieval: Which files were grepped, and which keywords were searched
Tool Calling: Input parameters, output parameters, and time cost for each git / grep / test run
Intermediate States: After executing each step, what updated understanding of the task it has
Reflection / Decision Branches: At which step did it change its mind, and why
Model Calling: The prompt, response, and token consumption of each LLM call
Final Output: The submitted diff

Stringing this entire sequence together is the Trajectory of this task.

However, "is this trajectory a good sample?" is very difficult for humans to define manually. For example, the final result is correct, but it took three incorrect steps in between. Or the final result is incorrect, but the first 5 steps of reasoning are correct—should these 5 steps be extracted separately as training signals? Furthermore, the trajectory contains returned real-world business data (orders, customer names, internal interface responses); sanitization is not as simple as string replacement, and structured sanitization is required before it can enter the dataset.

Challenge 3: Evaluation – Beyond Single-Point Scoring

For LLM-as-Judge, scoring is done for a single point. In the Agent era, evaluation needs to be conducted on three levels: step-level (whether the tool call in each step is correct), trajectory-level (whether the entire path is reasonable, without detours, rollbacks, or infinite loops), and outcome-level (whether the final delivery meets the requirements).

The conclusions across these three levels may be completely inconsistent.

Challenge 4: Asset Accumulation – Lack of Standardization

The asset forms of models are very clear: SFT data, DPO pairs, and LoRA weights—the industry has a consensus, and the toolchains are mature.

The asset forms in the Agent era are currently in a divergence phase. They can flow back into prompt improvements, be structured into a few-shot experience library, be made into episodic memory, or be extracted into reusable skills or sub-processes. Each form digests trajectories differently, and none of them have a unified container like model weights. As a result, even if an enterprise completes the first three steps, how the assets in the final step are landed, where they land, and who consumes them often remains a pending question.

Therefore, although Agents have gone online and serve more and more users, the evolutionary assets owned by enterprises may not have increased. This has become the true state of enterprise Agent evolution.

II. Alibaba Cloud's AgentLoop Practice

AgentLoop is a one-stop self-evolution platform for enterprise-level Agents launched by Alibaba Cloud. It provides core capabilities such as full-stack Agent observation and auditing, Agent evaluation and experimentation, and Agent asset management and continuous optimization, helping enterprises build an Agent evolution data flywheel.

To address the pain points of building an evolution flywheel for enterprise Agents, AgentLoop's solution is:

Core Component 1: Full-Stack Observability Analysis: Complete Trajectory Execution Paths

Through LoongSuite's open-source auto-instrumentation framework, AgentLoop upgrades the collection object from a tuple to a complete Trajectory.

LoongSuite integrates three levels of semantic specifications: the OTel GenAI community standard (including STEP / MCP span extensions contributed by Alibaba), the AgentLoop product-side data contract, and the collection layer's own extensions (exclusive fields for session / turn / step / cost), covering a total of 55 GenAI semantic fields. In a line-by-line comparison with third-party source code, LoongSuite's effective field coverage is 84%, while the highest competing product is only 51%.

The Trajectory collected by LoongSuite provides four types of cross-verified diagnostic views: Call Tree (drilling down span latency proportion step by step), Reasoning Trajectory (restoring the ReAct thought-tool-observation sequence to detect invalid loops), Timeline (distinguishing serialization/parallelism and blocking wait), and Tracing Topology Map (restoring global call relationships).

By correlating data across these four diagnostic views, a 23-second latency issue can be pinpointed to specific redundant LLM loop calls.

Core Component 2: Agent Ontology & Automated Dataset Pipelines

Just having Trajectory is not enough; otherwise, the collected observation data remains isolated metadata, which are individual spans unconnected to each other.

AgentLoop did a second thing on top of Trajectory: building a topology oriented to Agent entity relationships based on UModel, called Agent Ontology. Its function is to graph the collected observation data: automatically discovering the entity relationship topology among Agent → Tool → Model, breaking down data silos, and achieving deterministic correlation and reasoning analysis.

With Agent Ontology, every Trajectory is a relational map with a topological structure. Which Agent called which tools, which model was called behind which tools, which step was a key decision node, and which step was just auxiliary. Operations and algorithm teams can look at problems from an Agent perspective, eliminating the need to search for a needle in a haystack within flat logs.

On top of the Ontology, AgentLoop adds an automated Pipeline: Trace2Dataset. Its logic is: online full production runtime data (Trajectory) goes through Pipeline orchestration for data source access → data dimensionality reduction (filtering / deduplication / sampling) → feature extraction (intent / difficulty / scenario tags) → AI review and rewriting → writing to target datasets, thus automatically building Golden Datasets (high-quality classic samples) and BadCase Datasets (typical failure cases).

Overall, the Pipeline can save over 90% of Token consumption and time costs.

Core Component 3: Agent-as-a-Judge Evaluation Framework

Once data is collected and datasets are constructed, the next question is evaluation.

In the paper "Agent-as-a-Judge: Evaluate Agents with Agents", the Meta AI and KAUST teams constructed the DevAI benchmark. With 55 real AI development tasks and 365 hierarchical user requirements, it requires evaluators to not only look at the final deliverables but also check whether each intermediate step meets the structured requirements.

The paper ran three evaluation methods simultaneously on the same benchmark: human experts, LLM-as-a-Judge, and Agent-as-a-Judge. The result showed that consistency with human expert evaluations increased from about 65% with LLM-Judge to 90% with Agent-Judge. However, the report also mentioned that the cost of human evaluation in the US is about $86/hour, which is far higher than LLM-as-a-Judge and Agent-as-a-Judge. The evaluation cost of Agent-as-a-Judge is only 1/30 of that of human evaluation.

Therefore, AgentLoop adopts the Agent-as-a-Judge evaluator, which can be understood as productizing the Agent-as-a-Judge evaluation paradigm. The evaluator itself is an Agent that plans, calls tools, replays trajectories, and makes judgments based on multi-step reasoning from intermediate states.

AgentLoop provides 13 standard evaluators, including Agent task completion, Agent response evidence support, Agent tool invocation success rate, etc., and supports custom modes.

These evaluators support:

Q&A Accuracy: Multi-turn fact-checking + hallucination detection
Skill Execution Quality: Tool call chain validation and result checking
Intent Achievement: Evaluation of complex task goal satisfaction
Security and Compliance: Authorization / sensitive information / harmful content detection
Contextual Consistency: Cross-turn memory and state tracking
Business Customization: Users can construct evaluators for specific business scenarios via custom Prompt + Skill + Tool

Overall, through comprehensive automated data collection, Agent Ontology, automated dataset construction Pipeline, and Agent-as-a-Judge paradigm evaluation Agents, AgentLoop achieves continuous evaluation, serving as the infrastructure for the evolution flywheel.

Core Component 4: Contextual Engineering via Memory & Experience Libraries

However, in the evolution flywheel, full-stack data collection, topological recognition, and evaluation are essentially just scoring tools for Agent performance. Using these scores to turn Agent evolution assets into improved Agent performance is the ultimate goal of building the flywheel.

AgentLoop breaks down this problem into two paths:

Path 1: Data-driven Agent tuning. Automatically collect BadCases from evaluation results → cluster failure patterns → write Agent end-to-end (co-writing Prompt / Skill / toolchain) → verify enhancement with regression testing. This is the path to "quickly lift the baseline"; it yields fast results but relies on human iteration pace.

Path 2: Trajectory-driven self-evolution closed loop. Automatically record complete call trajectories and context during Agent runtime, automatically extract reusable experience rules from successful/failed Trajectories, inject experience rules into Agent context on demand (Just-in-Time loading), evaluate the post-injection performance, and continuously iterate and optimize the experience library.

To productize the above two paths, AgentLoop provides two independent components: the memory library and the experience library.

Among them, the memory library covers four strategies: fact, episode, summary, and customization, depositing user preferences and historical contexts into a long-term searchable layer, which will be automatically injected when a similar request is encountered next time. The experience library focuses on the extraction and reuse of success patterns, generalizing them into experience rules through co-construction with business experts in various industries. These are categorized into long-term memories or Skills, which will be automatically activated when similar scenarios appear again.

AgentLoop’s memory library and experience library refer to successful practices in the self-evolution field within the industry, including Hermes' trajectory self-reflection, DreamGym's RL training framework synthesizing experience replays, and Reflexion's episodic reflection (failure experience back-feeding mechanism).

Therefore, full-stack observation collects complete Trajectories, Agent Ontology graphs data, Pipelines auto-build datasets, standardized evaluators accurately assess real performance, and memory/experience libraries feed good experiences back into the Agent context. This constitutes a self-operating evolution flywheel.

III. The Evolution Flywheel: Catalyst for the Next Phase of Enterprise Agents

Because the evolution flywheel infrastructure is not yet mature, and converting evaluation results into Agent evolution assets relies heavily on industry experience, most enterprise Agents fall into the dilemma of falling behind as soon as they go online, making it difficult to achieve the perfect expectation of making Agents smarter as they are used.

In LangChain's "State of Agent Engineering," it was found that 22.8% of production teams do not perform evaluations at all, offline evaluation coverage is only 52.4%, online evaluation is only 37.3%, and 32% of teams list "quality" as the number one barrier in production environments. Databricks' "State of AI Agents" gave a set of data showing that the number of enterprises adopting evaluation is only 17% of those adopting governance.

The realistic dilemma faced by most enterprises is a vicious cycle: without the evolution flywheel infrastructure, they dare not scale up; without scaling up, they have no observation data; and without data, they cannot evolve.

Alibaba Cloud AgentLoop hopes to join hands with enterprises to start the second half of enterprise Agents through a complete evolution flywheel infrastructure. AgentLoop is currently in the beta phase. Welcome to join our user service DingTalk group (Group ID: 168330022816) to apply for beta access.

Community

The Second Half of the Enterprise Agent Era: How to Make Agents Smarter the More They Are Used?

I. The Current State of Enterprises Manually Crafting Agent Evolution Flywheels

Challenge 1: Data Collection – From Tuples to Topologies

Challenge 2: Dataset Construction – The Ambiguity of "Good" Trajectories

Challenge 3: Evaluation – Beyond Single-Point Scoring

Challenge 4: Asset Accumulation – Lack of Standardization

II. Alibaba Cloud's AgentLoop Practice

Core Component 1: Full-Stack Observability Analysis: Complete Trajectory Execution Paths

Core Component 2: Agent Ontology & Automated Dataset Pipelines

Core Component 3: Agent-as-a-Judge Evaluation Framework

Core Component 4: Contextual Engineering via Memory & Experience Libraries

III. The Evolution Flywheel: Catalyst for the Next Phase of Enterprise Agents

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Alibaba Cloud Model Studio

Managed Service for Prometheus

Qwen

Alibaba Cloud for Generative AI