Qwen-AgentWorld: Language World Models for General Agents

group_capybaras_flat

Today we release Qwen-AgentWorld, a native language world model that simulates agent environments across seven domains:

Native world modeling: environment modeling is the training objective from continual pre-training onward (CPT → SFT → RL), not a post hoc adaptation on top of a general-purpose LLM.
Seven domains, one model: a single model simulates text-based (MCP, Search, Terminal, SWE) and GUI-based (Web, OS, Android) environments, with knowledge transferring across domains.

Together with the model, we release AgentWorldBench, a seven-domain evaluation benchmark with paired ground-truth observations from real environments. Both are available on Hugging Face and ModelScope.

Language agents are trained to act in interactive environments, but no language model has been explicitly trained to model the environments themselves — to predict what happens next given the current state and an agent’s action.

Roadmap: Qwen-AgentWorld represents our attempt to investigate how world modeling built on language models can further push the boundaries of general agent capabilities.

We explore both how to achieve language world modeling and how to apply it to advance general agents:

First, we build a foundation model for agentic environment simulation: Qwen-AgentWorld is the first language world model to cover seven agent interaction domains within a single model (MCP, Search, Terminal, SWE, Web, OS, Android), trained through CPT → SFT → RL on more than 10M real environment interaction trajectories. On AgentWorldBench, Qwen-AgentWorld-397B-A17B achieves the highest overall simulation quality, outperforming GPT-5.4, Claude Opus 4.8, and Gemini 3.1 Pro.
Second, we investigate the role of world modeling in agent training through two complementary paradigms: as a decoupled environment simulator, it provides superior scalability and controllability for agentic RL. Controllable simulated RL shapes agent behavior in ways that real environments cannot and significantly outperforms RL trained solely in real-world environments; as a unified agent foundation model, LWM warm-up enables effective transfer to multi-turn agentic tasks across seven benchmarks, including three entirely out of domain, without requiring any RL fine-tuning on agentic tasks, providing initial validation that language world models can serve as a foundation for building stronger agent models.

teaser
Qwen-AgentWorld: a native language world model across seven unified domains, with two complementary paradigms for enhancing general agents.

Part I: Building the Foundation Model for Agentic Environment Simulation

Seven Domains, One Model

Qwen-AgentWorld covers seven categories of interactive environments. For the three GUI domains, environment observations take the form of renderable code (accessibility tree XML, HTML, UI hierarchy markup) rather than pixel frames, enabling text-only world modeling of visual environments.

table

Training Pipeline

pipeline_overview
Three-stage training pipeline: CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity.

Qwen-AgentWorld is trained end-to-end with environment modeling as the explicit objective from continual pre-training onward. The three-stage pipeline follows one principle: CPT injects, SFT activates, RL sharpens.

Stage 1: Continual Pre-Training (CPT) injects environment knowledge through non-thinking trajectories. The data draws from dedicated agent infrastructure (containerized execution sandboxes, MCP servers, Android/web/OS emulators), open environment interaction traces, and in-house agentic trajectories. Beyond environment data, we incorporate specialized-domain world knowledge corpora spanning industrial control, cybersecurity, law, medicine, finance, and current affairs. A key contribution is turn-level information-theoretic loss masking: four surface-level statistics per (action, observation) pair identify turns carrying genuine environment information, and mask the rest from the loss while retaining them as context.

Stage 2: Supervised Fine-Tuning (SFT) activates next-state prediction as an explicit thinking pattern via <think>...</think> blocks. We use rejection sampling to select high-quality thinking trajectories, resulting in 7,094 training samples.

Stage 3: Reinforcement Learning (RL) sharpens output quality with hybrid rewards. We use GSPO for RL training. The reward combines a rubric-based LLM judge evaluating multi-dimensional quality with rule-based verifiers for domains where exact correctness can be checked programmatically.

AgentWorldBench

bench_overview
Overview of AgentWorldBench: domain distribution, source benchmarks, evaluation dimensions, and per-domain trajectory statistics.

To evaluate language world models, we introduce AgentWorldBench, a comprehensive benchmark constructed from real-world observations of 5 frontier model trajectories on 9 established benchmarks, such as Tool Decathlon, Terminal-Bench 1.0 & 2.0, and OSWorld-Verified. Every evaluation sample is paired with a ground-truth observation obtained from real environment execution, enabling reference-grounded scoring. AgentWorldBench evaluates world modeling quality through open-ended rubric judging across 5 dimensions — format, factuality, consistency, realism, and quality — probing the reasoning, knowledge, and long-context capabilities.

Performance

bench_results_bar
AgentWorldBench results: five-dimensional rubric mean per domain. Qwen-AgentWorld-397B-A17B achieves the highest overall score (58.71), outperforming GPT-5.4 (58.25) and other frontier models.

Qwen-AgentWorld-397B-A17B achieves the highest overall average (58.71), surpassing GPT-5.4 (58.25) and all other frontier models. The advantage is most pronounced on Terminal and SWE, the two domains where predictions require accurate modeling of code execution state and tool API behavior.

At the 35B-A3B scale, the three-stage pipeline lifts the overall average by +8.66 points (47.73 → 56.39), bringing Qwen-AgentWorld-35B-A3B above Claude Sonnet 4.6 (56.04). The improvement is consistent across both text and GUI domains.

Inside the World Model’s Mind

Beyond aggregate performance, what makes a language world model interesting is how it reasons. We analyze 129 thinking traces across four text-based domains and find three emergent reasoning patterns.

lwm_reasoning_patterns
LWM reasoning patterns: deliberative self-correction, information leakage prevention, and multi-step causal reasoning.

Deliberative self-correction. The model uses “Wait!” as a cognitive interrupt to revise intermediate predictions. Across 129 turns, we count 1,347 such interrupts (10.4 per turn), spanning factual errors, epistemological limits (“I cannot actually execute np.random.seed(42)”), and perspective-taking.

Information leakage prevention. In the Search domain, the model holds a reference answer that the agent is trying to find. When the query is unrelated, the model prevents leakage by ensuring snippets do not accidentally reveal the target — the world-model equivalent of theory of mind.

Multi-step causal reasoning. Predicting the output of curl -s localhost:3000 | python3 -m json.tool requires a six-step chain: Node.js missing → server never started → no listener on port 3000 → curl fails silently → empty pipe → json.tool raises JSONDecodeError.

Part II: Investigating the Role of World Modeling in Agent Training

We investigate two complementary paradigms through which world modeling enhances general agents.

Why World Modeling Matters for Agents?

Not to Replace Real Environments, Not for Cost Reduction, but as a Complementary Axis for Pushing the Frontier

What a language world model does. In an agent-environment interaction loop, the policy decides what to do and the world model predicts _what happens next_. A language world model takes the current interaction history and an agent’s action, and predicts what the environment would return: the terminal output, the API response, the updated DOM. This is not template-based generation. Faithful simulation requires multi-step causal reasoning (chaining six system-knowledge steps to predict a curl pipeline failure), stateful tracking (maintaining referential integrity across nine sequential Notion API calls), and domain-specific knowledge (Unix semantics, API schemas, browser rendering rules).

Why not just/only use real environments? Real-environment interaction remains the gold standard for grounding agent behavior. Language world models are not designed to replace it, nor are they primarily a cost-reduction measure. Instead, LWMs open a complementary axis to supplement real environments:

(1) Scalability and controllability beyond real environments. An LWM enables turn-level scaling of diverse environments without dedicated infrastructure (sandboxes, GUI virtual machines), spanning extreme scenarios, real-world tasks, and high-value professional domains where real execution is infeasible due to irreversible operations or proprietary deployments. Beyond scalability, LWMs offer precise controllability: targeted perturbations that are rare or absent in real environments can systematically expose agent weaknesses. Training against these perturbations helps agents handle edge cases that real-environment training alone cannot cover, ultimately surpassing agents trained solely in real environments.

(2) Internalized world prediction as an agent capability. A capable general agent should possess both decision-making and world-modeling abilities. World modeling enables agents to predict future environment states to refine action selection, effectively performing mental simulation as an internal planning step — whereas traditional agent training focuses solely on state-to-action decision-making. Next-state prediction is thus internalized as a meta-reasoning pattern similar to “reflection” but oriented toward the future: _predict before you act_. Furthermore, accurate next-state prediction itself requires reasoning, knowledge, instruction following, and long-context handling — capabilities that are foundational to general agents.

What makes general-purpose language environment simulation possible? Building a general-purpose language world model requires three ingredients working together. First, _environment diversity_: training on trajectories from as many distinct environments as possible, so the model encounters the full spectrum of state-transition patterns rather than memorizing a narrow set. Second, _cross-domain generalization_: our experiments show that training on a single text domain yields gains on all other text domains, suggesting shared underlying environment modeling capabilities that compound as domain coverage grows. Third, _world knowledge through CPT_: environment trajectories alone cannot provide the factual grounding needed for faithful simulation. Simulating a regulatory compliance platform requires legal knowledge; simulating search-engine responses on current events requires up-to-date factual coverage. By incorporating specialized-domain world knowledge corpora (industrial control, cybersecurity, law, medicine, finance, current affairs) during continual pre-training, the model acquires the factual substrate on which environment simulation depends. These three ingredients, environment diversity, cross-domain transfer, and world knowledge, together enable a single model to serve as a general-purpose simulator across seven agent interaction domains.

Paradigm I: Decoupled Simulation

As a standalone simulator, where the policy agent and the world model are separate models, Qwen-AgentWorld provides scalability and controllability that real environments cannot. In this Sim RL setup, the world model replaces the real environment during agent RL training: the agent acts, the world model predicts the next observation, and the agent learns from these simulated rollouts. Key findings:

Zero-shot environment generalization. Qwen-AgentWorld simulates 4k OpenClaw environments entirely absent from training, yielding Sim RL gains of +4.3 on Claw-Eval and +7.1 on QwenClawBench with no domain-specific adaptation.
Controlled simulation matters. Uncontrolled Sim RL provides negligible improvement; controllable perturbations lift MCPMark by +12.3 and WideSearch by +16.3, far exceeding uncontrolled Sim RL.
Surpassing real-environment training. Controllable Sim RL exceeds Real RL trained against a live search engine (50.3% vs. 45.6% F1), while shaping more targeted agent behavior through adversarial snippet design.
Fictional worlds work. Agents trained in fully invented, self-consistent worlds generalize to real search tasks, while structurally preventing the agent from confusing simulated facts with real-world knowledge.
State is the bottleneck. Sim RL effectiveness depends on providing the world model with a sufficiently detailed initial state; without it, simulation fidelity degrades and downstream gains diminish.

Generalizable Environment Scaling

We test whether the world model generalizes to environments entirely absent from training. OpenClaw is an open-source agent platform whose tasks span scheduling, coding, email triage, browser automation, and file management — entirely out of distribution for Qwen-AgentWorld. We simulate 4,000 OpenClaw environments for agent RL training, without any domain-specific adaptation, and also ablate the simulator itself: using Qwen3.6-Plus as the simulator yields negligible improvement, while Qwen-AgentWorld-397B-A17B produces substantial gains — confirming that world-model quality is the bottleneck for Sim RL. The agent learns little from interacting with an unfaithful simulator.

table2
All scores averaged over 3 independent rollouts with 256K maximum sequence length.

Controllable Simulation

The more powerful capability is controllability: using natural-language instructions to shape the simulator’s behavior during training. We validate two modes.

MCP: environment adaptation. We synthesize simulation system prompts from real MCP tool-use trajectories: each prompt specifies the tool schemas and server configuration, summarizes the hidden environment state (database contents, permission settings, service availability), and defines controllable simulation instructions that shape how the simulator responds at each turn. Control instructions inject targeted perturbations — intermittent API errors, paginated responses requiring follow-up calls, incomplete intermediate results that force multi-step retrieval, and partial failures in batch operations — to systematically expose agent weaknesses that real deployments rarely produce.

The results reveal a sharp contrast: standard Sim RL without control instructions provides no meaningful gain (Tool Decathlon actually drops from 32.4 to 31.5), because the simulator lacks sufficient grounding to produce faithful responses. With controllable simulation, Tool Decathlon improves by +3.7 and MCPMark by +12.3. Controllability is not merely a factor in the magnitude of improvement — it is a prerequisite for Sim RL to work at all in this domain. The larger gain on MCPMark (+12.3 vs. +3.7) suggests that controllable simulation is especially effective for tasks requiring many sequential tool calls and careful handling of intermediate results.

table3

Search: fictional-world construction. We construct 1,000 self-contained fictional environments, each anchored by a relational database (300–500 rows) of internally consistent fictional facts. A time-shifted environment might contain a 2029 smartphone market ranking with real brand names but non-existent model numbers. Since answers exist only within the fictional setting, the agent cannot bypass the search tool by answering from parametric memory; since all facts are invented, the agent cannot confuse simulated facts with real-world knowledge.

table4

Sim RL vs. Real RL

widesearch_rl_comparison
Sim RL vs. Real RL on WideSearch: controllable Sim RL tracks or slightly exceeds Real RL trained with a live search engine.

Performance. We directly compare controllable Sim RL against Real RL (trained with a live search engine) on WideSearch. Sim RL tracks or slightly exceeds Real RL: F1 by Item reaches 50.3% at step 60, compared to 45.6% for Real RL.

widesearch_tool_use_comparison
Tool usage divergence: Sim-RL-trained agents increase web_extractor calls while Real-RL-trained agents decrease them, reflecting how controllable simulation shapes distinct agent behaviors.

Behavior. The more informative signal comes from agent behavior. Both regimes reduce web_search calls from ~5 to ~3.5 per trajectory, but web_extractor calls diverge sharply: Sim RL increases usage from 2.5 to 4.0, while Real RL decreases from 2.5 to 1.5. Because the simulated snippets deliberately withhold detailed content, the Sim-RL-trained agent learns that extracting full pages is necessary for assembling complete answers. Controllable simulation shapes agent behavior in targeted ways that real environments cannot.

Paradigm II: Agent Foundation Model

In Paradigm I, the agent and world model are separate models. Here we unify them: the same model that selects actions also predicts environment states. LWM training instills next-state prediction as an internalized reasoning capability. Key findings:

Radical cross-task generalization. Single-turn, non-agentic LWM RL warm-up with no tool calls transfers to multi-turn, tool-calling agentic tasks across seven benchmarks of five domains.
Domain generalization. Gains emerge on completely out-of-distribution domains entirely absent from LWM training (+11.3 on Claw-Eval, +9.7 on QwenClawBench, +9.0 on BFCL v4), confirming transferable capabilities rather than domain-specific shortcuts.
Next-state prediction as meta-reasoning pattern. LWM training teaches the agent to mentally simulate environment responses before acting, which generalizes across task formats and domains.

We validate this by running LWM RL on Qwen3.5-35B-A3B-SFT — a single-turn task with no tool calls — then evaluating directly on multi-turn, tool-calling agentic tasks across seven benchmarks without additional fine-tuning, including three out-of-domain benchmarks absent from LWM training.

table5

The out-of-domain results are particularly notable: the LWM training pipeline contains no Claw or function-calling data, yet gains of +11.3, +9.7, and +9.0 emerge on domains entirely absent from world-model training.

Build with Qwen-AgentWorld

Deployment

We have open-sourced Qwen-AgentWorld-35B-A3B (Hugging Face, ModelScope), a language world model built on a MoE architecture with 35B total parameters / 3B active parameters, supporting a 256K context window. It can be deployed and used in the following ways.

# SGLang
python -m sglang.launch_server \
    --model-path Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --context-length 262144 \
    --reasoning-parser qwen3

# vLLM
vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --trust-remote-code

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-AgentWorld-35B-A3B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-AgentWorld-35B-A3B")

Evaluation

AgentWorldBench is available on Hugging Face and Modelscope as per-domain JSONL files, each containing interaction trajectories with ground-truth observations from real environments. Evaluation uses a three-step pipeline via eval/eval.py: (1) infer — run the world model to generate predicted observations, (2) judge — score each prediction against the ground truth across five dimensions (format, factuality, consistency, realism, quality) using an LLM judge, (3) aggregate — compute per-domain and overall scores. Both the world model and the judge use OpenAI-compatible APIs, supporting SGLang, vLLM, or proprietary endpoints. See the GitHub README for full setup, data format, and example commands.

Summary

Qwen-AgentWorld is a native language world model covering seven agent interaction domains within a single model at two scales (35B-A3B and 397B-A17B). A three-stage recipe, CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity, progressively builds world-modeling capability from the ground up. We investigate two complementary paradigms through which world modeling enhances general agents. As a decoupled simulator, we validate the effectiveness of controllable simulation on Tool Decathlon, MCPMark, and WideSearch, surpassing both uncontrolled simulation and real-environment training. As a unified agent foundation model, LWM warm-up transfers to multi-turn agentic tasks across seven benchmarks, including three entirely out-of-domain, providing initial validation that language world models can serve as a foundation for building stronger agent models. Language world modeling opens a complementary axis for scaling general agents beyond what real-environment interaction alone can provide.

Citation

@article{zuo2026qwen,
  title={Qwen-agentworld: language world models for general agents},
  author={Zuo, Yuxin and Xiao, Zikai and Sheng, Li and Huang, Fei and Tu, Jianhong and Liu, Yuxuan and Tang, Tianyi and Hu, Xiaomeng and Su, Yang and Lan, Qingfeng and others},
  journal={arXiv preprint arXiv:2606.24597},
  year={2026}
}

Source

Community