Recoleta Item Note

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

MEM proposes a method for adding multi-scale memory to robot vision-language-action models: using video to remember fine-grained details from the past few seconds, and using language compression to remember semantic…

Embodied AI

vision-language-actionrobot-memorylong-horizon-controlembodied-foundation-modeldexterous-manipulation

Open arXiv Source markdown

Summary

MEM proposes a method for adding multi-scale memory to robot vision-language-action models: using video to remember fine-grained details from the past few seconds, and using language compression to remember semantic events spanning up to more than ten minutes. It targets real-world long-horizon manipulation tasks, especially scenarios like kitchen cleanup and cooking that require continuously tracking progress and handling occlusions or retrying after failures.

Problem

Existing end-to-end robot policies usually only look at the current observation, or directly concatenate a small number of past observations; this is insufficient for long-horizon, multi-stage tasks because computation and latency quickly become unmanageable.
Robots need two different kinds of memory: short-term fine-grained memory for recovering from occlusion, dynamic estimation, and re-grasping; and long-term semantic memory for remembering task progress, such as which steps have been completed and which cabinet doors are still open.
If only a single memory form is used (images only, language only, keyframes only, etc.), it often leads to undesirable trade-offs among spatial precision, temporal coverage, and inference efficiency, limiting performance on complex real-world robotic tasks.

Approach

The policy is split into two levels: the high-level policy takes the current observation, task goal, and existing language memory to output the next subtask instruction and update the language memory; the low-level policy takes a recent observation sequence and executes actions for the subtask.
Long-term memory is represented with natural-language summaries: instead of storing the full history, the model continuously maintains a short semantic summary of “what has happened and is still important”; training labels are automatically generated by an external LLM based on subtask sequences and success/failure markers, with explicit compression and forgetting.
Short-term memory is represented with an efficient video encoder: spatial attention and causal temporal attention are alternated in the ViT, compressing multi-frame visual history into the current timestep representation, and only the current timestep tokens are fed into the VLA backbone, thereby controlling latency.
The video encoder does not add new learnable parameters; it mainly changes the attention pattern and temporal positional encoding, so it can directly inherit pretrained vision-language model weights.
The method is integrated into (\pi_{0.6}) VLA: pretraining uses 6-frame input (5 past frames + current frame, with 1-second stride), and post-training/inference can scale to 18 frames, 54 seconds of observation memory; overall it also supports tasks requiring semantic memory of up to 15 minutes.

Results

The paper claims that MEM enables policies to complete real-world robot tasks requiring memory of up to 15 minutes, including kitchen clean-up and grilled cheese sandwich, as well as long-horizon manipulation such as recipe setup.
At the implementation level, the memory scales supported by MEM include: short-term video memory scalable to 18 frames / 54 seconds, and long-term language memory covering task trajectories of up to 15 minutes.
In the experimental setup, training for the long-horizon recipe setup task used 42 recipes, and evaluation was conducted on 5 unseen recipes, unseen kitchens, and unseen objects; each policy/task used 10 rollouts, and results are reported as mean ± standard error.
The paper explicitly claims that, compared with the memory-free (\pi_{0.6}), MEM significantly improves success rate on long-horizon tasks and achieves state-of-the-art performance on multiple complex manipulation tasks; however, the provided excerpt does not include specific success-rate or score values, so precise per-task gains cannot be listed.
Ablation findings: both short-term video memory and long-term language memory are essential; removing either component clearly weakens long-horizon task performance. The authors also claim that “naive language memory” (directly concatenating historical instructions without compression) is clearly weaker than MEM’s compressed language memory, because the train-inference distribution shift is more severe.
On shorter tasks, MEM also claims to provide in-context adaptation: for example, adjusting grasp height after a failed grasp, or changing the door-opening direction based on feedback; the excerpt does not provide quantitative numbers for this part, but it is one of the paper’s core capability claims.

Link

http://arxiv.org/abs/2603.03596v2

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart