World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds.
To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity.
Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.
MaskWAM architecture. (a) Training: Noisy RGB and mask latents are channel-concatenated and denoised by a unified DiT. The model jointly optimizes future RGB, future masks, and action chunking under a joint flow-matching framework with decoupled noise schedules τv and τa. (b) Attention mask: A block-wise causal attention mask enables unified RGB, mask, and action training. (c) Inference: Conditioned on optional first-frame masks to resolve spatial ambiguities, the model leverages KV-caching to efficiently generate actions from partially denoised latents.
Click a video to play. Use the arrows to see more demos.
| Method | Type | Task 1 | Task 2 | Task 3 | Task 4 | Avg |
|---|---|---|---|---|---|---|
| π0 | VLA | 57 | 54 | 54 | 58 | 55.8 |
| π0.5 | VLA | 83 | 55 | 74 | 77 | 72.3 |
| FastWAM | WAM | 88 | 76 | 77 | 75 | 79.0 |
| Ours (RGB-only) | WAM | 86 | 77 | 76 | 78 | 79.3 |
| Ours | WAM | 91 | 82 | 81 | 83 | 84.3 |
Real-robot evaluation on language-clear tasks. Success rates are reported in %. We collect an average of 100 demonstrations per task, and each model is evaluated over 100 trials per task with diverse object placements.
| Method | Type | Spatial | Object | Goal | Long | Avg |
|---|---|---|---|---|---|---|
| WorldVLA | VLA | 87.6 | 96.2 | 83.4 | 60.0 | 81.8 |
| GR00T-N1 | VLA | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| π0 | VLA | 96.8 | 98.8 | 95.8 | 85.2 | 94.1 |
| π0.5 | VLA | 98.6 | 98.2 | 98.0 | 92.4 | 96.8 |
| Motus | WAM | 96.8 | 99.8 | 96.6 | 97.6 | 97.7 |
| FastWAM | WAM | 98.2 | 100.0 | 97.0 | 95.2 | 97.6 |
| Ours (RGB-only) | WAM | 96.8 | 99.6 | 97.0 | 95.8 | 97.3 |
| Ours (Mask-only) | WAM | 97.2 | 99.8 | 97.4 | 96.0 | 97.6 |
| Ours | WAM | 98.8 | 100.0 | 98.2 | 96.4 | 98.4 |
Performance on LIBERO. MaskWAM sets a new state-of-the-art at 98.4% average, surpassing recent VLAs (π0.5) and WAMs (Motus, FastWAM). The auxiliary mask-prediction objective lifts our RGB-only variant from 97.3% to 98.4%, improving the base policy even without visual prompts at deployment. Attention maps show the RGB-only model attends to spurious backgrounds, while mask supervision focuses MaskWAM on task-relevant regions.
| Method | Hammer | Bell | Card | Burger | Stand | Shoe | Avg |
|---|---|---|---|---|---|---|---|
| π0 | 68 | 72 | 81 | 79 | 63 | 74 | 72.8 |
| FastWAM | 83 | 87 | 92 | 94 | 80 | 90 | 87.7 |
| Ours (RGB-only) | 82 | 87 | 91 | 93 | 79 | 92 | 87.3 |
| Ours (Mask-only) | 85 | 90 | 93 | 93 | 81 | 91 | 88.8 |
| Ours | 88 | 93 | 95 | 97 | 85 | 95 | 92.2 |
Performance on RoboTwin 2.0. MaskWAM reaches a state-of-the-art 92.2% average across six randomized tasks, beating π0 and FastWAM by 19.4% and 4.5%. The Mask-only variant (88.8%) already outperforms RGB-only (87.3%), and unifying both modalities maximizes performance (92.2%), confirming that auxiliary mask futures focus the model on task-relevant regions.
Click a video to play. Use the arrows to see more demos.
Performance on Language-Ambiguous Tasks.
We test how explicit spatial prompting resolves target uncertainty across four settings: one in-distribution and three zero-shot axes.
(1) In-Distribution: MaskWAM reaches a 92.9% success rate; masks outperform textual coordinates, confirming that spatial cues are better delivered visually than via text.
(2) Distractors: With unseen clutter, MaskWAM holds 90.4% (vs. 52.9% for π0-mask).
(3) Novel Instances: On unseen target objects (e.g., a novel cup for a trained bowl), MaskWAM reaches 74.6% (vs. 44.6%), showing category-level skill transfer.
(4) Lighting: Under illumination shifts, it stays robust at 81.7%.
Across all settings, mask prediction consistently beats both visual-prompt VLA and textual-coordinate baselines.
Long-horizon, high-precision task. MaskWAM successfully completes all 8 targets.
Visualization of predicted RGB frames, masks, and attention masks in real data. For visualization only, we decode full future RGB and mask sequences offline; during real-world deployment, MaskWAM uses partial denoising and does not generate full future sequences at test time.
If you find MaskWAM useful in your research, please cite:
@article{yu2026maskwam,
title = {MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models},
author = {Yu, Hanyang and Lin, Haitao and Zhang, Jingbo and Zhang, Wenyao and Gu, Chenghao and Li, Heng and Tan, Ping},
journal = {arXiv preprint arXiv:2606.13515},
year = {2026},
eprint = {2606.13515},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}