---
source: arxiv
url: http://arxiv.org/abs/2603.08546v1
published_at: '2026-03-09T16:13:32'
authors:
- Yixuan Wang
- Rhythm Syed
- Fangyu Wu
- Mengchao Zhang
- Aykut Onol
- Jose Barreiros
- Hooshang Nayyeri
- Tony Dear
- Huan Zhang
- Yunzhu Li
topics:
- world-model
- robot-policy-training
- policy-evaluation
- action-conditioned-video-prediction
- sim2real
relevance_score: 0.95
run_id: materialize-outputs
language_code: en
---

# Interactive World Simulator for Robot Policy Training and Evaluation

## Summary
This paper proposes Interactive World Simulator (IWS), an interactive world model for robot policy training and evaluation. It learns action-conditioned video prediction from a moderate-scale real robot interaction dataset and achieves stable long-horizon interaction on a single consumer GPU.

## Problem
- Existing robot world models are often **too slow**, frequently relying on heavy diffusion sampling, making them unsuitable for real-time interaction and large-scale data generation.
- Existing methods also often become unstable during **long-horizon rollout prediction**, where accumulated errors lead to robot pose drift and inconsistent object interactions, making them difficult to use for reliable policy training and evaluation.
- This matters because robot imitation learning and policy iteration depend heavily on large amounts of data and frequent evaluation, while real-robot data collection/evaluation is expensive, slow, and hard to reproduce.

## Approach
- The core idea is to first compress images into a **2D latent space**, perform action-conditioned future prediction only in latent space, and then decode back to pixels, making the system faster and more stable.
- The method has two stages: first, train an autoencoder with a **CNN encoder + consistency-model decoder** to obtain high-fidelity reconstructions; then freeze the autoencoder and train an **action-conditioned consistency latent dynamics model** to predict the next-frame latent.
- The dynamics model uses the latent variables of several past frames and actions as context, denoises the “noised latent of the last frame,” and learns multimodal futures; the network is implemented with **3D conv, FiLM, and spatiotemporal attention**.
- To support long-horizon rollout, inference uses an **autoregressive sliding window**, and small noise is injected into the context during training to improve robustness against error propagation when “its own predictions are used as subsequent inputs.”
- This world model can be directly used for two types of applications: teleoperating inside the simulator to collect demonstration data for training imitation policies, and performing reproducible policy evaluation in the simulator.

## Results
- **Long-horizon interaction and speed**: IWS runs stably at **15 FPS** on a **single RTX 4090** and supports long-horizon interactive rollout for **more than 10 minutes**.
- **Video prediction metrics outperform baselines**: On action-conditioned prediction aggregated over 7 tasks for **192 steps (19.2 seconds)**, IWS achieves **MSE 0.005±0.005**, **LPIPS 0.051±0.019**, **FID 63.50±13.78**, **PSNR 25.82±2.72**, **SSIM 0.831±0.039**, **UIQI 0.960±0.019**, and **FVD 243.20±103.58**; all are better than DINO-WM, UVA, Dreamer4, and Cosmos (for example, baseline FVD values are **1752.57, 2213.29, 1747.26, 799.34**, respectively).
- **Task coverage**: Experiments cover **6 real-world tasks + 1 simulated task**, including rigid objects, deformable objects, articulated objects, object piles, and multi-object interactions, on the **ALOHA bimanual robot** platform.
- **Data efficiency and accessibility**: In the real world, about **600 episodes** are collected per task, each with **200 steps**, requiring about **6 hours per task** for one person; for the simulated task, a scripted policy generates **10,000 episodes** of random interaction data. The authors claim that a moderate-scale dataset is sufficient to train an effective interactive world model.
- **For policy training**: Data generated by the world model is used to train imitation policies such as **DP, ACT, π0, π0.5**. Across various mixing ratios from **100% simulator data to 100% real data**, policy performance is **comparable** to training on the same amount of real-world data; however, the excerpt **does not provide specific success-rate numbers**.
- **For policy evaluation**: The authors report a **strong correlation** between simulator evaluation and real-world performance across multiple tasks and training checkpoints, suggesting it can serve as a faithful surrogate; however, the excerpt **does not provide quantitative values such as correlation coefficients**.

## Link
- [http://arxiv.org/abs/2603.08546v1](http://arxiv.org/abs/2603.08546v1)
