Recoleta Item Note

OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies

vision-language-actiontest-time-guidancegeneralist-robot-policycollision-avoidancesemantic-groundingrobot-manipulation

Summary

OmniGuide proposes a unified test-time guidance framework that uses “attractive/repulsive energy fields” in 3D space to adjust the action sampling of generalist VLA robots. Its goal is to make existing generalist policies more reliable and safer on complex, cluttered, and high-precision tasks without retraining or adding more robot data.

Problem

Existing VLA generalist robot policies cover a wide range of tasks, but they often fail on complex spatial understanding, manipulation in cluttered scenes, fine-grained manipulation, and collision avoidance—they “can do many things, but none of them precisely enough.”
Common remedies rely on additional high-quality robot data and post-training/fine-tuning, which is costly and hard to scale, and may also damage the original generalization ability.
Different external capability sources (3D geometry, VLM semantic reasoning, human demonstrations) are powerful, but there is no unified way to turn them at test time into signals that can directly guide VLA action generation.

Approach

The core mechanism is simple: express all external guidance information as differentiable energy functions that form fields in 3D space which “attract toward targets and repel away from obstacles,” then backpropagate the gradient onto the actions generated by the VLA to change the sampling direction.
The method applies to diffusion/flow-matching generative robot policies; at each denoising step, it first estimates the current “clean action,” then converts the action into the end-effector’s Cartesian trajectory through a differentiable kinematics/dynamics model.
It then computes task energy over the trajectory: for example, collision avoidance uses SDF-based repulsive energy, semantic pointing uses a Gaussian attractive energy constructed from a 3D target point localized by a VLM, and human demonstrations use monotonic matching between hand trajectories and robot trajectories to construct attractive energy.
The final update equals the “original VLA’s natural action prior + guidance gradient”; candidate-sampling filtering can also be performed at the initial noise stage, balancing naturalness, constraint satisfaction, and multimodality.
The framework can combine multiple heterogeneous guidance sources, and the authors emphasize that it requires no retraining, no additional robot data, and only real-time gradient computation to operate in dynamic environments.

Results

The abstract reports that OmniGuide brings significant improvements in simulation and real-world environments, across multiple guidance sources and two SOTA generalist policies (e.g., π0.5, GR00T N1.6).
Quantitative result (explicitly stated in the abstract): success rate improves from 24.2% to 92.4%.
Quantitative result (explicitly stated in the abstract): collision-avoidance/safety rate improves from 7.0% to 93.5%.
The authors claim these gains are achieved without retraining, without additional robot data, and without significant execution latency.
The paper also claims that its unified framework can match or exceed prior methods designed specifically for a single guidance source, but the provided excerpt does not include more detailed task-level tables, dataset splits, or per-baseline numbers.

Link

http://arxiv.org/abs/2603.10052v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart