RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
RoboCasa365 is a large-scale household mobile manipulation simulation benchmark for training and evaluating generalist robots, focused on addressing the lack of a reproducible, systematic, large-scale benchmark. It…
Summary
RoboCasa365 is a large-scale household mobile manipulation simulation benchmark for training and evaluating generalist robots, focused on addressing the lack of a reproducible, systematic, large-scale benchmark. It simultaneously scales up tasks, environments, and demonstration data, and uses this to analyze key factors affecting multi-task training, robot foundation model training, and lifelong learning.
Problem
- Existing robot learning systems make it difficult to reliably measure "how far we are from general-purpose household robots" because there is no reproducible, systematic, sufficiently large-scale evaluation benchmark.
- Real-world data collection and evaluation are costly and noisy, making it hard to systematically study how task diversity, environment variation, and data scale affect generalization.
- Existing simulation frameworks usually have too few tasks, narrow environments, and limited data scale, making them insufficient to support training and fair comparison of generalist robot policy / robot foundation model.
Approach
- Build RoboCasa365: an extension of RoboCasa into a simulation framework with 365 everyday tasks, 2,500 kitchen environments, and 2,000+ hours of robot interaction data.
- At the task level, it includes 65 atomic tasks and 300 composite tasks; composite tasks are first generated by an LLM as high-level activities and task blueprints, then manually implemented in the simulator.
- At the environment level, it uses 50 real residential kitchen layouts × 50 styles = 2,500 pretraining environments, separated from 10 target environments for stricter generalization evaluation.
- At the data level, it provides 30k pretraining human demonstrations and 25k target-task human demonstrations, and uses MimicGen on 60 atomic tasks to expand from 100 seed demonstrations per task to 10k per task, producing 1615 hours of synthetic data.
- The benchmark evaluation covers three settings: large-scale multi-task training, foundation model pretraining + finetuning, and lifelong learning, and compares methods including Diffusion Policy, pi_0, pi_0.5, and GR00T N1.5.
Results
- Benchmark scale claim: 365 tasks, 2,500 kitchen environments, 612 hours of human demonstrations + 1615 hours of synthetic demonstrations; the paper claims it is among the first simulation frameworks to simultaneously provide "hundreds of tasks, thousands of environments, large-scale high-quality data, and a systematic benchmark."
- Multi-task training (300 pretraining tasks, evaluation on 50 target tasks): GR00T N1.5 achieves an average success rate of 20.0%, outperforming pi_0.5 at 16.9%, pi_0 at 15.0%, and Diffusion Policy at 6.1%. By task type, GR00T scores 43.0 / 9.6 / 4.4 on Atomic / Composite-Seen / Composite-Unseen respectively, showing that long-horizon composite tasks and unseen tasks are substantially harder.
- Benefit of foundation model training: On 50 target tasks, GR00T's average success rate with "target-data-only training" improves from 21.0% / 34.3% / 43.7% (with 10%/30%/100% target data) to 35.9% / 42.2% / 51.1% with "pretraining + target finetuning." The paper explicitly claims pretraining yields about 3× data efficiency improvement.
- The largest gains are on unseen composite tasks: On Composite-Unseen, with 100% target data, "target-only training" reaches 33.3%, while "pretraining + finetuning" reaches 42.1%; with 10% data, the numbers are 11.2% vs 22.7%.
- Strong zero-/few-shot transfer on atomic tasks, weak on composite tasks: With pretraining only, Atomic reaches 41.9%, but Composite-Seen / Unseen are only 0.0% / 0.2%, indicating that pretrained knowledge transfers more easily to short-horizon skills but remains insufficient for long-horizon planning.
- Lifelong learning shows significant catastrophic forgetting: In four-stage training, Atomic success rate drops from 41.5% in Phase 1 to 10.6% in Phase 4; 2-3 stage tasks drop from 24.5% to 1.7%, showing that as the model learns new longer-horizon tasks, performance on old tasks keeps declining.
Link
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.