MaskWAM:
Unifying Mask Prompting and Prediction for World-Action Models

1The Hong Kong University of Science and Technology
2Tencent Robotics X 3Tsinghua University
Work done during an internship at Tencent Robotics X Corresponding author

Video Presentation

Abstract

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds.

To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity.

Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

MaskWAM teaser

Method Overview

MaskWAM method overview

MaskWAM architecture. (a) Training: Noisy RGB and mask latents are channel-concatenated and denoised by a unified DiT. The model jointly optimizes future RGB, future masks, and action chunking under a joint flow-matching framework with decoupled noise schedules τv and τa. (b) Attention mask: A block-wise causal attention mask enables unified RGB, mask, and action training. (c) Inference: Conditioned on optional first-frame masks to resolve spatial ambiguities, the model leverages KV-caching to efficiently generate actions from partially denoised latents.

Experiments

Language-Clear Tasks

Real-Robot Demos

Click a video to play. Use the arrows to see more demos.

Real-Robot Success Rates

Method Type Task 1 Task 2 Task 3 Task 4 Avg
π0 VLA 5754545855.8
π0.5 VLA 8355747772.3
FastWAM WAM 8876777579.0
Ours (RGB-only) WAM 8677767879.3
Ours WAM 9182818384.3

Real-robot evaluation on language-clear tasks. Success rates are reported in %. We collect an average of 100 demonstrations per task, and each model is evaluated over 100 trials per task with diverse object placements.

Simulation Benchmarks

LIBERO Benchmark
MethodTypeSpatialObjectGoalLongAvg
WorldVLAVLA87.696.283.460.081.8
GR00T-N1VLA94.497.693.090.693.9
π0VLA96.898.895.885.294.1
π0.5VLA98.698.298.092.496.8
MotusWAM96.899.896.697.697.7
FastWAMWAM98.2100.097.095.297.6
Ours (RGB-only)WAM96.899.697.095.897.3
Ours (Mask-only)WAM97.299.897.496.097.6
OursWAM98.8100.098.296.498.4

Performance on LIBERO. MaskWAM sets a new state-of-the-art at 98.4% average, surpassing recent VLAs (π0.5) and WAMs (Motus, FastWAM). The auxiliary mask-prediction objective lifts our RGB-only variant from 97.3% to 98.4%, improving the base policy even without visual prompts at deployment. Attention maps show the RGB-only model attends to spurious backgrounds, while mask supervision focuses MaskWAM on task-relevant regions.

RoboTwin 2.0 Benchmark
MethodHammerBellCardBurgerStandShoeAvg
π068728179637472.8
FastWAM83879294809087.7
Ours (RGB-only)82879193799287.3
Ours (Mask-only)85909393819188.8
Ours88939597859592.2

Performance on RoboTwin 2.0. MaskWAM reaches a state-of-the-art 92.2% average across six randomized tasks, beating π0 and FastWAM by 19.4% and 4.5%. The Mask-only variant (88.8%) already outperforms RGB-only (87.3%), and unifying both modalities maximizes performance (92.2%), confirming that auxiliary mask futures focus the model on task-relevant regions.

Language-Ambiguous Tasks

Click a video to play. Use the arrows to see more demos.

In-Distribution

Distractors

Novel Instances

Lighting

Quantitative Results

Performance on language-ambiguous tasks

Performance on Language-Ambiguous Tasks. We test how explicit spatial prompting resolves target uncertainty across four settings: one in-distribution and three zero-shot axes. (1) In-Distribution: MaskWAM reaches a 92.9% success rate; masks outperform textual coordinates, confirming that spatial cues are better delivered visually than via text. (2) Distractors: With unseen clutter, MaskWAM holds 90.4% (vs. 52.9% for π0-mask). (3) Novel Instances: On unseen target objects (e.g., a novel cup for a trained bowl), MaskWAM reaches 74.6% (vs. 44.6%), showing category-level skill transfer. (4) Lighting: Under illumination shifts, it stays robust at 81.7%. Across all settings, mask prediction consistently beats both visual-prompt VLA and textual-coordinate baselines.

Long-Horizon High-Precision Task

Long-horizon, high-precision task. MaskWAM successfully completes all 8 targets.

Visualization

Visualization of predicted RGB frames, masks, and attention masks

Visualization of predicted RGB frames, masks, and attention masks in real data. For visualization only, we decode full future RGB and mask sequences offline; during real-world deployment, MaskWAM uses partial denoising and does not generate full future sequences at test time.

BibTeX

If you find MaskWAM useful in your research, please cite:

@article{yu2026maskwam,
  title         = {MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models},
  author        = {Yu, Hanyang and Lin, Haitao and Zhang, Jingbo and Zhang, Wenyao and Gu, Chenghao and Li, Heng and Tan, Ping},
  journal       = {arXiv preprint arXiv:2606.13515},
  year          = {2026},
  eprint        = {2606.13515},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}