Recoleta Item Note

WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

world-modeltrajectory-predictionmixture-of-expertsrobot-morphologyscalable-pretraining

Summary

WestWorld proposes a scalable trajectory world model for jointly learning dynamics across a wide variety of robotic systems, while explicitly encoding knowledge of robot morphological structure. Its core contribution is combining the ideas that “different robots should use different dynamics submodules” and “robot structural priors matter,” in order to improve zero-shot, few-shot, and downstream control performance.

Problem

Existing multi-robot trajectory world models typically force all systems into a single shared dense model. As the number of robot types grows, this easily leads to gradient conflicts and negative transfer, resulting in poor scalability.
Different robots differ in sensor/actuator dimensions and sampling frequencies, as well as in morphological and kinematic structure, making shared representations difficult to learn well.
Many methods treat trajectories only as token sequences and ignore the physical prior of robot morphological structure, leading to weak zero-shot generalization to unseen robots.

Approach

The authors propose WestWorld, a unified trajectory world model pretrained on 89 simulated and real-world environments to predict future states from past state-action history.
To handle dynamics differences across robots, the model uses a system-aware Mixture-of-Experts (Sys-MoE): it first learns a system embedding, then uses it to assign weights to multiple experts, allowing different robots to dynamically combine different experts instead of forcing all systems to share the same parameter set.
To inject physical structural priors, the model constructs a structural embedding: it represents the robot as a kinematic tree, converts it via LCRS, extracts pre-/in-/post-order traversal indices and object IDs, and embeds these structural indices into the state/action channel representations.
On the input side, each scalar state/action channel is first normalized, discretized, and embedded; in the backbone, self-attention models relationships among state channels, cross-attention injects action conditioning, and then Mamba-style SSM and MoE experts perform dynamics modeling.
The training objective is cross-entropy prediction over discretized next-state tokens; at inference time, multi-step prediction can be completed in a single forward pass.

Results

Zero-shot long-horizon prediction (50-step history input, 100-step prediction) outperforms baselines on all 3 unseen environments:
- Walker2d: MAE 16.350 vs TDM 20.122 / TrajWorld 22.261 / MLPEnsemble 26.006; MSE 5.064 vs 6.428 / 8.623 / 12.028.
- Hopper: MAE 13.731 vs 17.634 / 17.388 / 19.987; MSE 3.368 vs 5.076 / 5.441 / 7.216.
- Franka: MAE 7.737 vs 23.686 / 13.102 / 12.164; MSE 2.539 vs 8.435 / 5.127 / 4.271.
Few-shot adaptation (fine-tuning with only 10 episodes per dataset) also performs best on 3 real robot systems:
- Cassie: MAE 5.316±0.108 vs TrajWorld 7.834±0.167; MSE 0.808±0.025 vs 1.697±0.109.
- A1: MAE 4.227±0.120 vs TrajWorld 5.138±0.200; MSE 0.628±0.040 vs 0.900±0.050.
- UR5: MAE 4.925±0.317 vs TrajWorld 8.066±0.799; MSE 0.831±0.150 vs 2.117±0.433.
Scalability: as the number of pretraining environments increases from N=1,2,5,10,20,30, the paper states that WestWorld’s error remains relatively low and changes little, whereas TrajWorld degrades significantly as more environments are added; this section does not provide specific values in the excerpt, but it is emphasized as a main scalability conclusion.
Downstream and real-world deployment: the authors claim WestWorld significantly improves downstream model-based control performance across different robots, and has been deployed on a real Unitree Go1 demonstrating stable locomotion; the excerpt does not provide specific control scores or success-rate numbers.

Link

http://arxiv.org/abs/2603.14392v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart