---
source: arxiv
url: http://arxiv.org/abs/2603.03596v2
published_at: '2026-03-04T00:03:02'
authors:
- Marcel Torne
- Karl Pertsch
- Homer Walke
- Kyle Vedder
- Suraj Nair
- Brian Ichter
- Allen Z. Ren
- Haohuan Wang
- Jiaming Tang
- Kyle Stachowicz
- Karan Dhabalia
- Michael Equi
- Quan Vuong
- Jost Tobias Springenberg
- Sergey Levine
- Chelsea Finn
- Danny Driess
topics:
- vision-language-action
- robot-memory
- long-horizon-control
- embodied-foundation-model
- dexterous-manipulation
relevance_score: 0.97
run_id: materialize-outputs
language_code: en
---

# MEM: Multi-Scale Embodied Memory for Vision Language Action Models

## Summary
MEM proposes a method for adding **multi-scale memory** to robot vision-language-action models: using video to remember fine-grained details from the past few seconds, and using language compression to remember semantic events spanning up to more than ten minutes. It targets real-world long-horizon manipulation tasks, especially scenarios like kitchen cleanup and cooking that require continuously tracking progress and handling occlusions or retrying after failures.

## Problem
- Existing end-to-end robot policies usually only look at the current observation, or directly concatenate a small number of past observations; this is insufficient for **long-horizon, multi-stage** tasks because computation and latency quickly become unmanageable.
- Robots need two different kinds of memory: **short-term fine-grained memory** for recovering from occlusion, dynamic estimation, and re-grasping; and **long-term semantic memory** for remembering task progress, such as which steps have been completed and which cabinet doors are still open.
- If only a single memory form is used (images only, language only, keyframes only, etc.), it often leads to undesirable trade-offs among spatial precision, temporal coverage, and inference efficiency, limiting performance on complex real-world robotic tasks.

## Approach
- The policy is split into two levels: the **high-level policy** takes the current observation, task goal, and existing language memory to output the next subtask instruction and update the language memory; the **low-level policy** takes a recent observation sequence and executes actions for the subtask.
- **Long-term memory** is represented with natural-language summaries: instead of storing the full history, the model continuously maintains a short semantic summary of “what has happened and is still important”; training labels are automatically generated by an external LLM based on subtask sequences and success/failure markers, with explicit compression and forgetting.
- **Short-term memory** is represented with an efficient video encoder: spatial attention and causal temporal attention are alternated in the ViT, compressing multi-frame visual history into the current timestep representation, and only the current timestep tokens are fed into the VLA backbone, thereby controlling latency.
- The video encoder **does not add new learnable parameters**; it mainly changes the attention pattern and temporal positional encoding, so it can directly inherit pretrained vision-language model weights.
- The method is integrated into **\(\pi_{0.6}\)** VLA: pretraining uses 6-frame input (5 past frames + current frame, with 1-second stride), and post-training/inference can scale to **18 frames, 54 seconds** of observation memory; overall it also supports tasks requiring semantic memory of up to **15 minutes**.

## Results
- The paper claims that MEM enables policies to complete real-world robot tasks requiring memory of up to **15 minutes**, including **kitchen clean-up** and **grilled cheese sandwich**, as well as long-horizon manipulation such as recipe setup.
- At the implementation level, the memory scales supported by MEM include: short-term video memory scalable to **18 frames / 54 seconds**, and long-term language memory covering task trajectories of up to **15 minutes**.
- In the experimental setup, training for the long-horizon recipe setup task used **42 recipes**, and evaluation was conducted on **5 unseen recipes**, unseen kitchens, and unseen objects; each policy/task used **10 rollouts**, and results are reported as mean ± standard error.
- The paper explicitly claims that, compared with the memory-free **\(\pi_{0.6}\)**, MEM **significantly improves success rate** on long-horizon tasks and achieves **state-of-the-art performance** on multiple complex manipulation tasks; however, the provided excerpt **does not include specific success-rate or score values**, so precise per-task gains cannot be listed.
- Ablation findings: both **short-term video memory** and **long-term language memory** are essential; removing either component clearly weakens long-horizon task performance. The authors also claim that “naive language memory” (directly concatenating historical instructions without compression) is clearly weaker than MEM’s compressed language memory, because the train-inference distribution shift is more severe.
- On shorter tasks, MEM also claims to provide **in-context adaptation**: for example, adjusting grasp height after a failed grasp, or changing the door-opening direction based on feedback; the excerpt does not provide quantitative numbers for this part, but it is one of the paper’s core capability claims.

## Link
- [http://arxiv.org/abs/2603.03596v2](http://arxiv.org/abs/2603.03596v2)
