ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models
ATA is a training-free inference framework for Vision-Language-Action (VLA) models. It improves robotic control performance through two implicit reasoning signals—attention-guided and action-guided—without adding…
Summary
ATA is a training-free inference framework for Vision-Language-Action (VLA) models. It improves robotic control performance through two implicit reasoning signals—attention-guided and action-guided—without adding annotations or retraining. Its core advantages are plug-and-play usability, low computational overhead, and simultaneous improvements in success rate, robustness, and inference efficiency in some scenarios.
Problem
- Existing methods for adding “reasoning” capabilities to VLA models usually rely on CoT-style step-by-step annotations, visual annotations such as boxes/masks, and other costly data collection and labeling processes, making them hard to scale.
- Many methods also require additional training or retraining of large models, consuming substantial compute and lengthening inference sequences, which reduces real-time performance.
- Pure VLA models map observations directly to actions, and in complex manipulation tasks they are prone to cascading errors caused by early misjudgments, hurting task success rates and robustness.
Approach
- ATA is a training-free test-time enhancement method: it first runs a forward pass with the original model to extract implicit cues, then processes the image to “highlight important regions and suppress the background” before feeding it back into the same VLA model.
- Attention-guided: it extracts the attention from the final query token to image patches from the model’s intermediate layers, aggregates and normalizes it into a mask, and highlights the visual regions the model itself considers relevant to the task.
- Action-guided: it uses the robot end-effector pose and camera parameters to project the “likely motion direction” onto the image, constructing a fan-shaped/conical soft RoI that emphasizes regions related to action intent.
- The two signals are combined according to a schedule: typically, the first frame uses attention guidance, while early steps use action guidance, to reduce the propagation of early errors over a long prediction horizon.
- The method requires no CoT, boxes, masks, or extra supervision, and can be plugged into different VLA models such as OpenVLA, pi0-fast, HybridVLA, and GR00T-N1.5.
Results
- In the LIBERO environment, ATA improves OpenVLA performance by 5.2% and pi0-fast by 2.0%.
- In the RLBench environment, ATA improves HybridVLA by 5.3%.
- In the real-world GR00T-N1.5 three-layer block stacking task (block size 3cm × 3cm × 3cm), performance improves by up to 10% in complex scenarios.
- The paper claims that ATA improves task success rate and robustness while maintaining or even improving inference efficiency; the method introduces only one extra forward pass when guidance is applied, but the abstract does not provide a unified latency/throughput comparison.
- Experiments cover multiple mainstream VLA models—OpenVLA, pi0-fast, HybridVLA, GR00T-N1.5—across both simulation and real-robot settings, emphasizing its plug-and-play generalization.
Link
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.