
Today we release Qwen-AgentWorld, a native language world model that simulates agent environments across seven domains:
Together with the model, we release AgentWorldBench, a seven-domain evaluation benchmark with paired ground-truth observations from real environments. Both are available on Hugging Face and ModelScope.
Language agents are trained to act in interactive environments, but no language model has been explicitly trained to model the environments themselves — to predict what happens next given the current state and an agent’s action.
Roadmap: Qwen-AgentWorld represents our attempt to investigate how world modeling built on language models can further push the boundaries of general agent capabilities.
We explore both how to achieve language world modeling and how to apply it to advance general agents:

Qwen-AgentWorld: a native language world model across seven unified domains, with two complementary paradigms for enhancing general agents.
Qwen-AgentWorld covers seven categories of interactive environments. For the three GUI domains, environment observations take the form of renderable code (accessibility tree XML, HTML, UI hierarchy markup) rather than pixel frames, enabling text-only world modeling of visual environments.


Three-stage training pipeline: CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity.
Qwen-AgentWorld is trained end-to-end with environment modeling as the explicit objective from continual pre-training onward. The three-stage pipeline follows one principle: CPT injects, SFT activates, RL sharpens.
Stage 1: Continual Pre-Training (CPT) injects environment knowledge through non-thinking trajectories. The data draws from dedicated agent infrastructure (containerized execution sandboxes, MCP servers, Android/web/OS emulators), open environment interaction traces, and in-house agentic trajectories. Beyond environment data, we incorporate specialized-domain world knowledge corpora spanning industrial control, cybersecurity, law, medicine, finance, and current affairs. A key contribution is turn-level information-theoretic loss masking: four surface-level statistics per (action, observation) pair identify turns carrying genuine environment information, and mask the rest from the loss while retaining them as context.
Stage 2: Supervised Fine-Tuning (SFT) activates next-state prediction as an explicit thinking pattern via <think>...</think> blocks. We use rejection sampling to select high-quality thinking trajectories, resulting in 7,094 training samples.
Stage 3: Reinforcement Learning (RL) sharpens output quality with hybrid rewards. We use GSPO for RL training. The reward combines a rubric-based LLM judge evaluating multi-dimensional quality with rule-based verifiers for domains where exact correctness can be checked programmatically.

Overview of AgentWorldBench: domain distribution, source benchmarks, evaluation dimensions, and per-domain trajectory statistics.
To evaluate language world models, we introduce AgentWorldBench, a comprehensive benchmark constructed from real-world observations of 5 frontier model trajectories on 9 established benchmarks, such as Tool Decathlon, Terminal-Bench 1.0 & 2.0, and OSWorld-Verified. Every evaluation sample is paired with a ground-truth observation obtained from real environment execution, enabling reference-grounded scoring. AgentWorldBench evaluates world modeling quality through open-ended rubric judging across 5 dimensions — format, factuality, consistency, realism, and quality — probing the reasoning, knowledge, and long-context capabilities.

AgentWorldBench results: five-dimensional rubric mean per domain. Qwen-AgentWorld-397B-A17B achieves the highest overall score (58.71), outperforming GPT-5.4 (58.25) and other frontier models.
Qwen-AgentWorld-397B-A17B achieves the highest overall average (58.71), surpassing GPT-5.4 (58.25) and all other frontier models. The advantage is most pronounced on Terminal and SWE, the two domains where predictions require accurate modeling of code execution state and tool API behavior.
At the 35B-A3B scale, the three-stage pipeline lifts the overall average by +8.66 points (47.73 → 56.39), bringing Qwen-AgentWorld-35B-A3B above Claude Sonnet 4.6 (56.04). The improvement is consistent across both text and GUI domains.
Beyond aggregate performance, what makes a language world model interesting is how it reasons. We analyze 129 thinking traces across four text-based domains and find three emergent reasoning patterns.

LWM reasoning patterns: deliberative self-correction, information leakage prevention, and multi-step causal reasoning.
Deliberative self-correction. The model uses “Wait!” as a cognitive interrupt to revise intermediate predictions. Across 129 turns, we count 1,347 such interrupts (10.4 per turn), spanning factual errors, epistemological limits (“I cannot actually execute np.random.seed(42)”), and perspective-taking.
Information leakage prevention. In the Search domain, the model holds a reference answer that the agent is trying to find. When the query is unrelated, the model prevents leakage by ensuring snippets do not accidentally reveal the target — the world-model equivalent of theory of mind.
Multi-step causal reasoning. Predicting the output of curl -s localhost:3000 | python3 -m json.tool requires a six-step chain: Node.js missing → server never started → no listener on port 3000 → curl fails silently → empty pipe → json.tool raises JSONDecodeError.
We investigate two complementary paradigms through which world modeling enhances general agents.
Not to Replace Real Environments, Not for Cost Reduction, but as a Complementary Axis for Pushing the Frontier
What a language world model does. In an agent-environment interaction loop, the policy decides what to do and the world model predicts _what happens next_. A language world model takes the current interaction history and an agent’s action, and predicts what the environment would return: the terminal output, the API response, the updated DOM. This is not template-based generation. Faithful simulation requires multi-step causal reasoning (chaining six system-knowledge steps to predict a curl pipeline failure), stateful tracking (maintaining referential integrity across nine sequential Notion API calls), and domain-specific knowledge (Unix semantics, API schemas, browser rendering rules).
Why not just/only use real environments? Real-environment interaction remains the gold standard for grounding agent behavior. Language world models are not designed to replace it, nor are they primarily a cost-reduction measure. Instead, LWMs open a complementary axis to supplement real environments:
(1) Scalability and controllability beyond real environments. An LWM enables turn-level scaling of diverse environments without dedicated infrastructure (sandboxes, GUI virtual machines), spanning extreme scenarios, real-world tasks, and high-value professional domains where real execution is infeasible due to irreversible operations or proprietary deployments. Beyond scalability, LWMs offer precise controllability: targeted perturbations that are rare or absent in real environments can systematically expose agent weaknesses. Training against these perturbations helps agents handle edge cases that real-environment training alone cannot cover, ultimately surpassing agents trained solely in real environments.
(2) Internalized world prediction as an agent capability. A capable general agent should possess both decision-making and world-modeling abilities. World modeling enables agents to predict future environment states to refine action selection, effectively performing mental simulation as an internal planning step — whereas traditional agent training focuses solely on state-to-action decision-making. Next-state prediction is thus internalized as a meta-reasoning pattern similar to “reflection” but oriented toward the future: _predict before you act_. Furthermore, accurate next-state prediction itself requires reasoning, knowledge, instruction following, and long-context handling — capabilities that are foundational to general agents.
What makes general-purpose language environment simulation possible? Building a general-purpose language world model requires three ingredients working together. First, _environment diversity_: training on trajectories from as many distinct environments as possible, so the model encounters the full spectrum of state-transition patterns rather than memorizing a narrow set. Second, _cross-domain generalization_: our experiments show that training on a single text domain yields gains on all other text domains, suggesting shared underlying environment modeling capabilities that compound as domain coverage grows. Third, _world knowledge through CPT_: environment trajectories alone cannot provide the factual grounding needed for faithful simulation. Simulating a regulatory compliance platform requires legal knowledge; simulating search-engine responses on current events requires up-to-date factual coverage. By incorporating specialized-domain world knowledge corpora (industrial control, cybersecurity, law, medicine, finance, current affairs) during continual pre-training, the model acquires the factual substrate on which environment simulation depends. These three ingredients, environment diversity, cross-domain transfer, and world knowledge, together enable a single model to serve as a general-purpose simulator across seven agent interaction domains.
As a standalone simulator, where the policy agent and the world model are separate models, Qwen-AgentWorld provides scalability and controllability that real environments cannot. In this Sim RL setup, the world model replaces the real environment during agent RL training: the agent acts, the world model predicts the next observation, and the agent learns from these simulated rollouts. Key findings:
We test whether the world model generalizes to environments entirely absent from training. OpenClaw is an open-source agent platform whose tasks span scheduling, coding, email triage, browser automation, and file management — entirely out of distribution for Qwen-AgentWorld. We simulate 4,000 OpenClaw environments for agent RL training, without any domain-specific adaptation, and also ablate the simulator itself: using Qwen3.6-Plus as the simulator yields negligible improvement, while Qwen-AgentWorld-397B-A17B produces substantial gains — confirming that world-model quality is the bottleneck for Sim RL. The agent learns little from interacting with an unfaithful simulator.

All scores averaged over 3 independent rollouts with 256K maximum sequence length.
The more powerful capability is controllability: using natural-language instructions to shape the simulator’s behavior during training. We validate two modes.
MCP: environment adaptation. We synthesize simulation system prompts from real MCP tool-use trajectories: each prompt specifies the tool schemas and server configuration, summarizes the hidden environment state (database contents, permission settings, service availability), and defines controllable simulation instructions that shape how the simulator responds at each turn. Control instructions inject targeted perturbations — intermittent API errors, paginated responses requiring follow-up calls, incomplete intermediate results that force multi-step retrieval, and partial failures in batch operations — to systematically expose agent weaknesses that real deployments rarely produce.
The results reveal a sharp contrast: standard Sim RL without control instructions provides no meaningful gain (Tool Decathlon actually drops from 32.4 to 31.5), because the simulator lacks sufficient grounding to produce faithful responses. With controllable simulation, Tool Decathlon improves by +3.7 and MCPMark by +12.3. Controllability is not merely a factor in the magnitude of improvement — it is a prerequisite for Sim RL to work at all in this domain. The larger gain on MCPMark (+12.3 vs. +3.7) suggests that controllable simulation is especially effective for tasks requiring many sequential tool calls and careful handling of intermediate results.

Search: fictional-world construction. We construct 1,000 self-contained fictional environments, each anchored by a relational database (300–500 rows) of internally consistent fictional facts. A time-shifted environment might contain a 2029 smartphone market ranking with real brand names but non-existent model numbers. Since answers exist only within the fictional setting, the agent cannot bypass the search tool by answering from parametric memory; since all facts are invented, the agent cannot confuse simulated facts with real-world knowledge.


Sim RL vs. Real RL on WideSearch: controllable Sim RL tracks or slightly exceeds Real RL trained with a live search engine.
Performance. We directly compare controllable Sim RL against Real RL (trained with a live search engine) on WideSearch. Sim RL tracks or slightly exceeds Real RL: F1 by Item reaches 50.3% at step 60, compared to 45.6% for Real RL.

Tool usage divergence: Sim-RL-trained agents increase web_extractor calls while Real-RL-trained agents decrease them, reflecting how controllable simulation shapes distinct agent behaviors.
Behavior. The more informative signal comes from agent behavior. Both regimes reduce web_search calls from ~5 to ~3.5 per trajectory, but web_extractor calls diverge sharply: Sim RL increases usage from 2.5 to 4.0, while Real RL decreases from 2.5 to 1.5. Because the simulated snippets deliberately withhold detailed content, the Sim-RL-trained agent learns that extracting full pages is necessary for assembling complete answers. Controllable simulation shapes agent behavior in targeted ways that real environments cannot.
In Paradigm I, the agent and world model are separate models. Here we unify them: the same model that selects actions also predicts environment states. LWM training instills next-state prediction as an internalized reasoning capability. Key findings:
We validate this by running LWM RL on Qwen3.5-35B-A3B-SFT — a single-turn task with no tool calls — then evaluating directly on multi-turn, tool-calling agentic tasks across seven benchmarks without additional fine-tuning, including three out-of-domain benchmarks absent from LWM training.

The out-of-domain results are particularly notable: the LWM training pipeline contains no Claw or function-calling data, yet gains of +11.3, +9.7, and +9.0 emerge on domains entirely absent from world-model training.
We have open-sourced Qwen-AgentWorld-35B-A3B (Hugging Face, ModelScope), a language world model built on a MoE architecture with 35B total parameters / 3B active parameters, supporting a 256K context window. It can be deployed and used in the following ways.
# SGLang
python -m sglang.launch_server \
--model-path Qwen/Qwen-AgentWorld-35B-A3B \
--port 8000 \
--tensor-parallel-size 4 \
--context-length 262144 \
--reasoning-parser qwen3
# vLLM
vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--trust-remote-code
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-AgentWorld-35B-A3B",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-AgentWorld-35B-A3B")
AgentWorldBench is available on Hugging Face and Modelscope as per-domain JSONL files, each containing interaction trajectories with ground-truth observations from real environments. Evaluation uses a three-step pipeline via eval/eval.py: (1) infer — run the world model to generate predicted observations, (2) judge — score each prediction against the ground truth across five dimensions (format, factuality, consistency, realism, quality) using an LLM judge, (3) aggregate — compute per-domain and overall scores. Both the world model and the judge use OpenAI-compatible APIs, supporting SGLang, vLLM, or proprietary endpoints. See the GitHub README for full setup, data format, and example commands.
Qwen-AgentWorld is a native language world model covering seven agent interaction domains within a single model at two scales (35B-A3B and 397B-A17B). A three-stage recipe, CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity, progressively builds world-modeling capability from the ground up. We investigate two complementary paradigms through which world modeling enhances general agents. As a decoupled simulator, we validate the effectiveness of controllable simulation on Tool Decathlon, MCPMark, and WideSearch, surpassing both uncontrolled simulation and real-environment training. As a unified agent foundation model, LWM warm-up transfers to multi-turn agentic tasks across seven benchmarks, including three entirely out-of-domain, providing initial validation that language world models can serve as a foundation for building stronger agent models. Language world modeling opens a complementary axis for scaling general agents beyond what real-environment interaction alone can provide.
@article{zuo2026qwen,
title={Qwen-agentworld: language world models for general agents},
author={Zuo, Yuxin and Xiao, Zikai and Sheng, Li and Huang, Fei and Tu, Jianhong and Liu, Yuxuan and Tang, Tianyi and Hu, Xiaomeng and Su, Yang and Lan, Qingfeng and others},
journal={arXiv preprint arXiv:2606.24597},
year={2026}
}
1,438 posts | 500 followers
FollowAlibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - April 2, 2026
Alibaba Cloud Community - June 1, 2026
Alibaba Cloud Community - February 17, 2026
1,438 posts | 500 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn More
Alibaba Cloud for Generative AI
Accelerate innovation with generative AI to create new business success
Learn More
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by Alibaba Cloud Community