---
source: arxiv
url: http://arxiv.org/abs/2603.04356v1
published_at: '2026-03-04T18:20:03'
authors:
- Soroush Nasiriany
- Sepehr Nasiriany
- Abhiram Maddukuri
- Yuke Zhu
topics:
- robot-benchmark
- simulation-framework
- generalist-robot-policy
- robot-foundation-model
- lifelong-learning
- mobile-manipulation
relevance_score: 0.97
run_id: materialize-outputs
language_code: en
---

# RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

## Summary
RoboCasa365 is a large-scale household mobile manipulation simulation benchmark for training and evaluating generalist robots, focused on addressing the lack of a **reproducible, systematic, large-scale benchmark**. It simultaneously scales up tasks, environments, and demonstration data, and uses this to analyze key factors affecting multi-task training, robot foundation model training, and lifelong learning.

## Problem
- Existing robot learning systems make it difficult to reliably measure "how far we are from general-purpose household robots" because there is no **reproducible, systematic, sufficiently large-scale** evaluation benchmark.
- Real-world data collection and evaluation are costly and noisy, making it hard to systematically study how **task diversity, environment variation, and data scale** affect generalization.
- Existing simulation frameworks usually have too few tasks, narrow environments, and limited data scale, making them insufficient to support training and fair comparison of **generalist robot policy / robot foundation model**.

## Approach
- Build RoboCasa365: an extension of RoboCasa into a simulation framework with **365 everyday tasks**, **2,500 kitchen environments**, and **2,000+ hours** of robot interaction data.
- At the task level, it includes **65 atomic tasks** and **300 composite tasks**; composite tasks are first generated by an LLM as high-level activities and task blueprints, then manually implemented in the simulator.
- At the environment level, it uses **50 real residential kitchen layouts × 50 styles = 2,500 pretraining environments**, separated from **10 target environments** for stricter generalization evaluation.
- At the data level, it provides **30k pretraining human demonstrations** and **25k target-task human demonstrations**, and uses MimicGen on **60 atomic tasks** to expand from **100 seed demonstrations per task to 10k per task**, producing **1615 hours** of synthetic data.
- The benchmark evaluation covers three settings: **large-scale multi-task training, foundation model pretraining + finetuning, and lifelong learning**, and compares methods including Diffusion Policy, pi_0, pi_0.5, and GR00T N1.5.

## Results
- **Benchmark scale claim**: 365 tasks, 2,500 kitchen environments, **612 hours of human demonstrations + 1615 hours of synthetic demonstrations**; the paper claims it is among the first simulation frameworks to simultaneously provide "hundreds of tasks, thousands of environments, large-scale high-quality data, and a systematic benchmark."
- **Multi-task training (300 pretraining tasks, evaluation on 50 target tasks)**: GR00T N1.5 achieves an average success rate of **20.0%**, outperforming pi_0.5 at **16.9%**, pi_0 at **15.0%**, and Diffusion Policy at **6.1%**. By task type, GR00T scores **43.0 / 9.6 / 4.4** on Atomic / Composite-Seen / Composite-Unseen respectively, showing that long-horizon composite tasks and unseen tasks are substantially harder.
- **Benefit of foundation model training**: On 50 target tasks, GR00T's average success rate with "target-data-only training" improves from **21.0% / 34.3% / 43.7%** (with 10%/30%/100% target data) to **35.9% / 42.2% / 51.1%** with "pretraining + target finetuning." The paper explicitly claims pretraining yields about **3× data efficiency improvement**.
- **The largest gains are on unseen composite tasks**: On Composite-Unseen, with 100% target data, "target-only training" reaches **33.3%**, while "pretraining + finetuning" reaches **42.1%**; with 10% data, the numbers are **11.2% vs 22.7%**.
- **Strong zero-/few-shot transfer on atomic tasks, weak on composite tasks**: With pretraining only, Atomic reaches **41.9%**, but Composite-Seen / Unseen are only **0.0% / 0.2%**, indicating that pretrained knowledge transfers more easily to short-horizon skills but remains insufficient for long-horizon planning.
- **Lifelong learning shows significant catastrophic forgetting**: In four-stage training, Atomic success rate drops from **41.5%** in Phase 1 to **10.6%** in Phase 4; 2-3 stage tasks drop from **24.5%** to **1.7%**, showing that as the model learns new longer-horizon tasks, performance on old tasks keeps declining.

## Link
- [http://arxiv.org/abs/2603.04356v1](http://arxiv.org/abs/2603.04356v1)
