Recoleta Item Note

TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

TempoFit is a training-free temporal memory plug-in for pretrained Vision-Language-Action (VLA) policies, designed to transform originally "single-frame decision-making" models into long-horizon manipulation policies…

vision-language-actionlong-horizon-manipulationtemporal-memorykv-cachetraining-freesim2real

TempoFit is a training-free temporal memory plug-in for pretrained Vision-Language-Action (VLA) policies, designed to transform originally "single-frame decision-making" models into long-horizon manipulation policies that can leverage historical information. It directly reuses the model's internal layer-wise attention K/V cache, rather than stacking historical images or training additional temporal modules, making it easier to plug in while maintaining near-real-time inference.

  • Many existing VLAs are essentially memoryless at inference time: each step only looks at the current observation, which makes them prone to repeated actions, missed steps, and failed phase transitions in non-Markovian long-horizon tasks involving occlusion, state aliasing, and subtle post-action changes.
  • Directly stacking historical frames can introduce temporal information, but it increases visual tokens and inference latency, while bringing in many near-duplicate pixels, making it inefficient.
  • Training additional memory/fusion modules usually requires retraining or fine-tuning, and also changes the original single-frame inference graph, making true plug-and-play upgrades for strong pretrained VLAs difficult.
  • Core idea: treat the K/V in Transformer prefix attention as the model's native "state memory." TempoFit caches prefix K/V from past timesteps at several intermediate layers, instead of storing raw images or adding new tokens.
  • The retrieval method is K-to-K retrieval: it uses the current timestep's key to perform similarity matching against historical keys, and then reads out the corresponding historical K/V; the entire process has no additional parameters and no new training, while remaining consistent with the original attention geometry of the frozen backbone.
  • To prevent old information from interfering with current decisions, it introduces FGTB (Frame-Gap Temporal Bias): a fixed decay bias is added to more distant history so that retrieval favors content that is "recent and relevant."
  • The retrieved historical context is added back into the current K/V via a residual pathway, rather than by concatenating virtual tokens; then norm-preserving rescaling is applied to avoid distribution shift as much as possible under frozen weights.
  • The method is plug-and-play: it does not modify model parameters, does not change input context length, does not depend on the structure of the action head, and is compatible with different VLA backbones.
  • LIBERO-Long: on the (\pi_{0.5}) baseline, the average success rate improves from 92.6% to 96.6%, i.e. +4.0 percentage points; on QwenGR00T, it improves from 90.8% to 94.4%, i.e. +3.6 percentage points.
  • Compared with trained temporal methods, TempoFit matches or exceeds representative methods on LIBERO-Long: MemoryVLA 93.4%, HiF-VLA 96.4%; among them, TempoFit+(\pi_{0.5}) reaches 96.6%.
  • On the difficult subtask "Put both pots on stove", (\pi_{0.5}) improves from 58.0% to 84.0%, indicating that it is particularly effective for cross-stage temporal association.
  • CALVIN D-D: the average successful task length improves from 3.78 to 3.84 (QwenGR00T → TempoFit); step-wise success rates improve more noticeably on later instructions, e.g. the 5th task improves from 59.8 to 62.3.
  • CALVIN ABC-D: the average successful task length improves from 3.83 to 3.87 ((\pi_{0.5}) → TempoFit); the 5th task improves from 61.4 to 62.0.
  • The paper also claims that the method incurs negligible inference overhead and maintains near-real-time control, but no more detailed latency numbers are provided in the given excerpt.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.