MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

Video Presentation

Abstract

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds.

To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity.

Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

Method Overview

MaskWAM architecture. (a) Training: Noisy RGB and mask latents are channel-concatenated and denoised by a unified DiT. The model jointly optimizes future RGB, future masks, and action chunking under a joint flow-matching framework with decoupled noise schedules τ_v and τ_a. (b) Attention mask: A block-wise causal attention mask enables unified RGB, mask, and action training. (c) Inference: Conditioned on optional first-frame masks to resolve spatial ambiguities, the model leverages KV-caching to efficiently generate actions from partially denoised latents.

Experiments

Language-Clear Tasks

Real-Robot Demos

Click a video to play. Use the arrows to see more demos.

►

Real-Robot Success Rates

Method	Type	Task 1	Task 2	Task 3	Task 4	Avg
π₀	VLA	57	54	54	58	55.8
π_0.5	VLA	83	55	74	77	72.3
FastWAM	WAM	88	76	77	75	79.0
Ours (RGB-only)	WAM	86	77	76	78	79.3
Ours	WAM	91	82	81	83	84.3

Real-robot evaluation on language-clear tasks. Success rates are reported in %. We collect an average of 100 demonstrations per task, and each model is evaluated over 100 trials per task with diverse object placements.

Simulation Benchmarks

LIBERO Benchmark

Method	Type	Spatial	Object	Goal	Long	Avg
WorldVLA	VLA	87.6	96.2	83.4	60.0	81.8
GR00T-N1	VLA	94.4	97.6	93.0	90.6	93.9
π₀	VLA	96.8	98.8	95.8	85.2	94.1
π_0.5	VLA	98.6	98.2	98.0	92.4	96.8
Motus	WAM	96.8	99.8	96.6	97.6	97.7
FastWAM	WAM	98.2	100.0	97.0	95.2	97.6
Ours (RGB-only)	WAM	96.8	99.6	97.0	95.8	97.3
Ours (Mask-only)	WAM	97.2	99.8	97.4	96.0	97.6
Ours	WAM	98.8	100.0	98.2	96.4	98.4

Performance on LIBERO. MaskWAM sets a new state-of-the-art at 98.4% average, surpassing recent VLAs (π_0.5) and WAMs (Motus, FastWAM). The auxiliary mask-prediction objective lifts our RGB-only variant from 97.3% to 98.4%, improving the base policy even without visual prompts at deployment. Attention maps show the RGB-only model attends to spurious backgrounds, while mask supervision focuses MaskWAM on task-relevant regions.

RoboTwin 2.0 Benchmark

Method	Hammer	Bell	Card	Burger	Stand	Shoe	Avg
π₀	68	72	81	79	63	74	72.8
FastWAM	83	87	92	94	80	90	87.7
Ours (RGB-only)	82	87	91	93	79	92	87.3
Ours (Mask-only)	85	90	93	93	81	91	88.8
Ours	88	93	95	97	85	95	92.2

Performance on RoboTwin 2.0. MaskWAM reaches a state-of-the-art 92.2% average across six randomized tasks, beating π₀ and FastWAM by 19.4% and 4.5%. The Mask-only variant (88.8%) already outperforms RGB-only (87.3%), and unifying both modalities maximizes performance (92.2%), confirming that auxiliary mask futures focus the model on task-relevant regions.

Language-Ambiguous Tasks

Click a video to play. Use the arrows to see more demos.

In-Distribution

►

Distractors

►

Novel Instances

►

Lighting

►

Quantitative Results

Performance on Language-Ambiguous Tasks. We test how explicit spatial prompting resolves target uncertainty across four settings: one in-distribution and three zero-shot axes. (1) In-Distribution: MaskWAM reaches a 92.9% success rate; masks outperform textual coordinates, confirming that spatial cues are better delivered visually than via text. (2) Distractors: With unseen clutter, MaskWAM holds 90.4% (vs. 52.9% for π₀-mask). (3) Novel Instances: On unseen target objects (e.g., a novel cup for a trained bowl), MaskWAM reaches 74.6% (vs. 44.6%), showing category-level skill transfer. (4) Lighting: Under illumination shifts, it stays robust at 81.7%. Across all settings, mask prediction consistently beats both visual-prompt VLA and textual-coordinate baselines.

Long-Horizon High-Precision Task

►

Long-horizon, high-precision task. MaskWAM successfully completes all 8 targets.

Visualization

Visualization of predicted RGB frames, masks, and attention masks in real data. For visualization only, we decode full future RGB and mask sequences offline; during real-world deployment, MaskWAM uses partial denoising and does not generate full future sequences at test time.

BibTeX

If you find MaskWAM useful in your research, please cite:

@article{yu2026maskwam,
  title         = {MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models},
  author        = {Yu, Hanyang and Lin, Haitao and Zhang, Jingbo and Zhang, Wenyao and Gu, Chenghao and Li, Heng and Tan, Ping},
  journal       = {arXiv preprint arXiv:2606.13515},
  year          = {2026},
  eprint        = {2606.13515},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

MaskWAM:Unifying Mask Prompting and Prediction for World-Action Models