Recoleta Item Note

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

DepthCache 是一种面向视觉-语言-动作（VLA）模型推理加速的免训练视觉 token 压缩方法。它利用深度信息优先保留近处操作区与关键边界，在尽量不伤害机器人操控成功率的前提下降低推理延迟。

vision-language-actiontoken-mergingdepth-guidedtraining-freerobot-inference

Summary

在 LIBERO 基准、3 个不同 VLA 模型上，DepthCache 达到 1.07×–1.28× 推理加速，同时平均成功率下降 小于 1%。
对 OpenVLA：基线平均成功率 76.7%；DepthCache 为 75.7%（-1.0），速度 1.21×，token 保留率 78.9%。相比之下，FastV 为 64.0%（-12.7）/1.39×，SP-VLA 为 71.9%（-4.8）/1.50×。
对 π0.5：基线 97.9%；DepthCache 97.6%（-0.3）/1.28×，token 保留率 68.2%。而 FastV 为 77.6%（-20.3）/1.30×，ToSA 为 73.8%（-24.1）/0.94×。
对 GR00T：基线 93.1%；DepthCache 92.9%（-0.2）/1.07×，token 保留率 87.5%。
稳态下，双相机总 patch token 从 512 降到约 300。
真实机器人 3 个核心任务上（基于 π0.5），总成功数从 55/60 变为 52/60，平均延迟从 191 ms 降到 143 ms，达到 1.33× 加速。

Built with Recoleta

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.