Recoleta Item Note

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

DepthCache is a training-free visual token compression method for accelerating inference in Vision-Language-Action (VLA) models. It uses depth information to preferentially preserve the near-field manipulation area and…

vision-language-actiontoken-mergingdepth-guidedtraining-freerobot-inference

DepthCache is a training-free visual token compression method for accelerating inference in Vision-Language-Action (VLA) models. It uses depth information to preferentially preserve the near-field manipulation area and critical boundaries, reducing inference latency while minimizing harm to robotic manipulation success rates.

  • VLA models are highly promising for robotic manipulation, but the large number of visual tokens and heavy language backbone lead to high inference latency, making real-time closed-loop control difficult.
  • Existing token pruning or uniform-ratio merging methods damage spatial relationships, especially hurting tasks such as grasping and alignment that depend on fine-grained geometric reasoning.
  • Existing merging methods often require modifications to the vision encoder, lack cross-architecture portability, and do not exploit the naturally available depth-structure prior in robotic scenes.
  • Use depth maps to partition unprotected image patches by distance: merge less in the near-field workspace and more in the distant background; protected tokens are not compressed.
  • Use a “dual protection” mechanism to preserve critical tokens: one portion comes from cross-attention in the language model, indicating semantic importance; the other comes from depth edges, indicating geometric boundary importance.
  • Instead of completing merging all at once within a single frame, distribute the merging across multiple consecutive frames to exploit temporal redundancy, maintain stable representations, and reduce computation at each step.
  • Monitor depth changes; if a region becomes dynamic, restore it to full resolution. For the wrist camera, add a state machine based on end-effector motion to dynamically decide whether to apply aggressive compression.
  • The entire method runs outside the vision encoder, requiring no model changes or retraining, and can be directly applied to different VLA architectures.
  • On the LIBERO benchmark across 3 different VLA models, DepthCache achieves 1.07×–1.28× inference speedup while reducing average success rate by less than 1%.
  • For OpenVLA: baseline average success rate is 76.7%; DepthCache reaches 75.7% (-1.0) with 1.21× speed and 78.9% token retention. In comparison, FastV achieves 64.0% (-12.7) / 1.39×, and SP-VLA 71.9% (-4.8) / 1.50×.
  • For π0.5: baseline is 97.9%; DepthCache achieves 97.6% (-0.3) / 1.28× with 68.2% token retention. FastV reaches 77.6% (-20.3) / 1.30×, and ToSA 73.8% (-24.1) / 0.94×.
  • For GR00T: baseline is 93.1%; DepthCache achieves 92.9% (-0.2) / 1.07× with 87.5% token retention.
  • Under steady-state conditions, the total patch tokens from two cameras drop from 512 to about 300.
  • On 3 core real-robot tasks (based on π0.5), the total number of successes changes from 55/60 to 52/60, and average latency drops from 191 ms to 143 ms, achieving 1.33× speedup.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.