VLA enters a stage of “light modification, strong adaptation”
The strongest theme of the day was pushing pretrained vision-language-action models from merely “usable” to “more robust and transferable.” One line of work directly changes fine-tuning capacity allocation: LoRA-SP replaces fixed rank with dynamically activated rank per sample, alleviating capacity shortages and hyperparameter sensitivity across tasks and robot embodiments. Another line adds temporal memory without retraining the backbone: TempoFit reuses intermediate-layer K/V caches to give single-frame decision models long-horizon context. Together, they point to a shared trend: VLA is no longer only about scaling bigger base models, but about improving deployment adaptability through lighter-weight, plug-and-play mechanisms.
Representative sources
- Adaptive Capacity Allocation for Vision Language Action Fine-tuning — Donghoon Kim; Minji Bae; Unghui Nam; Gyeonghun Kim; Suyun Lee; Kyuhong Shim; …
- TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation — Jun Sun; Boyu Yang; Jiahao Zhang; Ning Ma; Chencheng Wu; Siqing Zhang; …