Recoleta Item Note
MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation
MetaWorld-X is a hierarchical world model framework for humanoid loco-manipulation. It decomposes complex control into multiple expert policies with human motion priors, and uses a VLM-supervised router to compose these…
humanoid-controlworld-modelmixture-of-expertsvision-language-modelsloco-manipulation
Summary
MetaWorld-X is a hierarchical world model framework for humanoid loco-manipulation. It decomposes complex control into multiple expert policies with human motion priors, and uses a VLM-supervised router to compose these experts according to task semantics, thereby improving naturalness, stability, and compositional generalization.
Problem
- The paper aims to solve the following issue: when a single monolithic policy learns locomotion and manipulation simultaneously on high-degree-of-freedom humanoid robots, it is prone to cross-skill gradient interference, motion pattern conflicts, jitter, falls, and unnatural movements.
- This matters because if humanoid robots are to execute real-world multi-stage tasks, they must maintain balance, move, and perform fine manipulation at the same time; optimizing only task return often sacrifices motion naturalness and stability.
- Existing world model or MoE methods either suffer from long-horizon rollout bias and policy mismatch, or lack explicit semantic guidance, making it difficult to achieve stable and composable skill orchestration.
Approach
- The core method is “divide and conquer”: first train a Specialized Expert Pool (SEP), learning basic skills such as standing, walking, running, sitting, carrying, and reaching as separate experts, thereby avoiding conflicts among different skills within a single policy.
- Each expert is trained with human motion data + imitation-constrained reinforcement learning: motion retargeting maps MoCap/SMPL movements to the robot, then an energy-based reward based on joint position/velocity error is used to match human motion, making the resulting behavior more natural and biomechanically consistent.
- The framework retains world model / latent planning: imitation alignment rewards are incorporated into the world model’s reward head and value function, and MPPI/CEM planning is performed in latent space to improve sample efficiency and anticipatory control.
- Then an Intelligent Routing Mechanism (IRM) is trained: it takes the current observation and task semantics as input, outputs mixture weights for each expert, and the final action is the weighted sum of expert actions.
- This router is trained through VLM-supervised distillation: it first uses task-level semantic relevance for coarse alignment, then refines with few-shot demonstrations, enabling a transition from reliance on VLM guidance to autonomous routing, while supporting zero-shot/few-shot compositional generalization.
Results
- On the basic skill evaluations in Humanoid-bench, IRM is stronger than strong baselines in both return and convergence speed: for example, on Walk, Ours reaches 1118.7±7.1, higher than TD-MPC2 644.2±162.3 and DreamerV3 428.2±14.5; convergence takes only 0.5M steps, better than TD-MPC2’s 1.8M and DreamerV3’s 6.0M.
- The advantage is especially large on Run: Ours achieves 2056.9±13.6, compared with TD-MPC2 66.1±4.7 and DreamerV3 298.5±84.5; convergence takes 1.0M steps, better than TD-MPC2’s 2.0M and DreamerV3’s 6.0M.
- Other basic skills also lead: Stand 815.9±0.3 vs TD-MPC2 749.8±63.1; Sit 862.2±2.1 vs 733.9±120.6; Carry 963.5±5.1 vs 438.0±72.9; and convergence is typically within 0.5–0.6M, significantly faster than the baselines’ 1.1–6.0M.
- In success rate over 10 independent trials, Ours reaches 9/10 on Stand/Walk/Run/Carry, and Sit 8/10; compared with TD-MPC2 at only 3/10、3/10、2/10、3/10、4/10, while PPO is mostly 0/10.
- On complex manipulation tasks, MetaWorld-X also outperforms baselines: Door 470.0±2.2 vs TD-MPC2 285.0±12.0, Basketball 250.0±11.9 vs 148.4±3.3, Push 70.0±2.1 vs -113.8±6.8, Truck 1500.0±15.6 vs 1213.2±1.1, Package -5200.0±47.2 vs -6788.5±552.7.
- Ablations show that both key components are important: on the Door task, the Full Model achieves a return of 303.95 with 12.64w training steps; removing the Router drops performance to 296.57 / 20.36w; removing VLM or IL causes task failure (among them, w/o IL has return 193.61, but cannot converge/adapt effectively).
Link
Built with Recoleta
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.