VLAs begin to emphasize on-demand inference and failure recovery
This set of work shifts the research focus from “bigger models” to “smarter scheduling.” Tri-System inserts a visual Critic between the high-level vision-language model (VLM) and the low-level vision-language-action model (VLA), replanning only upon completion, accidents, or stagnation. Act-Think-Abstain, meanwhile, classifies each execution into three cases: act directly, think first, or refuse to act. The shared signal is clear: real-time performance, safety, and out-of-distribution robustness are becoming first-class goals in VLA system design.
Representative sources
- Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation — Pengfei Yi; Yingjie Ma; Wenjiang Xu; Yanan Hao; Shuai Gan; Wanting Li; …
- Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models — Riccardo Andrea Izzo; Gianluca Bardaro; Matteo Matteucci