Recoleta Item Note

Interactive World Simulator for Robot Policy Training and Evaluation

This paper proposes Interactive World Simulator (IWS), an interactive world model for robot policy training and evaluation. It learns action-conditioned video prediction from a moderate-scale real robot interaction…

Embodied AI

world-modelrobot-policy-trainingpolicy-evaluationaction-conditioned-video-predictionsim2real

Open arXiv Source markdown

Summary

Problem

Existing robot world models are often too slow, frequently relying on heavy diffusion sampling, making them unsuitable for real-time interaction and large-scale data generation.
Existing methods also often become unstable during long-horizon rollout prediction, where accumulated errors lead to robot pose drift and inconsistent object interactions, making them difficult to use for reliable policy training and evaluation.
This matters because robot imitation learning and policy iteration depend heavily on large amounts of data and frequent evaluation, while real-robot data collection/evaluation is expensive, slow, and hard to reproduce.

Approach

The core idea is to first compress images into a 2D latent space, perform action-conditioned future prediction only in latent space, and then decode back to pixels, making the system faster and more stable.
The method has two stages: first, train an autoencoder with a CNN encoder + consistency-model decoder to obtain high-fidelity reconstructions; then freeze the autoencoder and train an action-conditioned consistency latent dynamics model to predict the next-frame latent.
The dynamics model uses the latent variables of several past frames and actions as context, denoises the “noised latent of the last frame,” and learns multimodal futures; the network is implemented with 3D conv, FiLM, and spatiotemporal attention.
To support long-horizon rollout, inference uses an autoregressive sliding window, and small noise is injected into the context during training to improve robustness against error propagation when “its own predictions are used as subsequent inputs.”
This world model can be directly used for two types of applications: teleoperating inside the simulator to collect demonstration data for training imitation policies, and performing reproducible policy evaluation in the simulator.

Results

Long-horizon interaction and speed: IWS runs stably at 15 FPS on a single RTX 4090 and supports long-horizon interactive rollout for more than 10 minutes.
Video prediction metrics outperform baselines: On action-conditioned prediction aggregated over 7 tasks for 192 steps (19.2 seconds), IWS achieves MSE 0.005±0.005, LPIPS 0.051±0.019, FID 63.50±13.78, PSNR 25.82±2.72, SSIM 0.831±0.039, UIQI 0.960±0.019, and FVD 243.20±103.58; all are better than DINO-WM, UVA, Dreamer4, and Cosmos (for example, baseline FVD values are 1752.57, 2213.29, 1747.26, 799.34, respectively).
Task coverage: Experiments cover 6 real-world tasks + 1 simulated task, including rigid objects, deformable objects, articulated objects, object piles, and multi-object interactions, on the ALOHA bimanual robot platform.
Data efficiency and accessibility: In the real world, about 600 episodes are collected per task, each with 200 steps, requiring about 6 hours per task for one person; for the simulated task, a scripted policy generates 10,000 episodes of random interaction data. The authors claim that a moderate-scale dataset is sufficient to train an effective interactive world model.
For policy training: Data generated by the world model is used to train imitation policies such as DP, ACT, π0, π0.5. Across various mixing ratios from 100% simulator data to 100% real data, policy performance is comparable to training on the same amount of real-world data; however, the excerpt does not provide specific success-rate numbers.
For policy evaluation: The authors report a strong correlation between simulator evaluation and real-world performance across multiple tasks and training checkpoints, suggesting it can serve as a faithful surrogate; however, the excerpt does not provide quantitative values such as correlation coefficients.

Link

http://arxiv.org/abs/2603.08546v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart