Recoleta Item Note

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

This paper proposes an adaptive inference framework for Vision-Language-Action (VLA) models that switches between direct execution (Act), additional reasoning (Think), and refusal to execute (Abstain) based on the…

Embodied AI

vision-language-actionadaptive-inferenceood-detectionrobot-safetyuncertainty-estimation

Open arXiv Source markdown

Summary

This paper proposes an adaptive inference framework for Vision-Language-Action (VLA) models that switches between direct execution (Act), additional reasoning (Think), and refusal to execute (Abstain) based on the complexity of the current state, aiming to balance efficiency, generalization, and safety. The key finding is that for judging task complexity, visual embeddings are more reliable than language or fused features.

Problem

Existing VLA systems often improve generalization through reasoning methods such as chain-of-thought, but reasoning at all times increases computational cost and latency and wastes resources on simple tasks.
These methods usually lack uncertainty / out-of-distribution recognition capability, and may become overconfident on OOD tasks, leading to catastrophic execution failures.
Robot deployment must simultaneously satisfy real-time performance, generalization, and safety, so a mechanism is needed to first determine whether the system should act directly at all.

Approach

Extract three types of embeddings—vision, text, and fused—from the VLM backbone of a pretrained VLA/SmolVLA; the authors also explicitly prevent the text encoder from seeing the image to isolate language uncertainty.
First apply PCA to reduce to 64 dimensions, then score features with two novelty estimators: GMM + Mahalanobis distance to model the global distribution, and 1-NN to capture local anomalies; the GMM uses Ledoit-Wolf shrinkage to stabilize covariance estimation.
Aggregate the scores into a small vector (mainly including GMM scores for vision/text/fused features and a visual kNN score), feed it into a lightweight MLP, and output a three-way decision: Act / Think / Abstain.
The “Think” branch is triggered only once at the first timestep of each episode, appending scene cues and subgoals to the text prompt before handing control back to the VLA; “Abstain” directly refuses high-risk OOD tasks.
To train the intermediate “partially OOD / Think” state, in addition to using LIBERO-PRO, the authors synthesize intermediate samples between ID and OOD features using Beta(0.5,0.5) mixup.

Results

Evaluated on LIBERO / LIBERO-PRO / a real robot (SO-ARM 101); the best configuration is MLP + GMM (vision-only) with Macro F1 = 84.34%, outperforming all alternatives.
Compared with a Baseline MLP trained directly on the raw embeddings, the proposed method is substantially stronger: the baseline reaches only 63.81% Macro F1; moreover, 86% of “Think” samples are misclassified as “Act”, showing that the baseline is overconfident in ambiguous scenarios.
Visual kNN is also competitive, reaching 73.90% F1, and the authors state that in the confusion matrix there is no confusion between “Act” and “Abstain”, meaning tasks that should be stopped are never mistakenly allowed to execute directly.
Multimodality does not provide gains: ensemble (all GMM + kNN) 71.41% F1, text-only 54.76% F1, and text-only fails to correctly identify even a single “Think” sample. This supports the argument that the semantic invariance of language can mask physical anomalies.
In terms of data efficiency, the baseline remains almost stuck at F1≈0.60 across different data scales; in contrast, vision-only GMM outperforms the baseline by 15% using only 1% of the data (fewer than 1000 samples), and approaches peak performance with 5% of the data. The abstract also reports that its vision-only configuration reaches 80% F1 using only 5% of the training data.
Ablation on the number of GMM components shows the best result at k=3; k=1 is clearly insufficient, while larger k brings diminishing returns and extra computational overhead.

Link

http://arxiv.org/abs/2603.05147v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart