Recoleta Item Note

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

vision-language-actionuav-navigationembodied-aiend-to-end-controlsim2real

Summary

AerialVLA proposes a minimalist end-to-end VLA model for UAV vision-language navigation, directly mapping dual-view images and fuzzy language prompts to continuous control and landing actions. It aims to break free from existing UAV-VLN dependence on oracle directional prompts and external object detectors, enabling navigation and precise landing under a more autonomous setting.

Problem

Existing UAV VLN methods often rely on dense oracle directional guidance, making models behave more like they “follow prompts” rather than perform true spatial reasoning and autonomous navigation.
Many systems also depend on external object detectors to decide when to land, creating a disconnect between perception and control and reducing robustness in open environments.
UAVs in dynamic 3D environments require continuous control and fine-grained visual localization, which are critical for real-time performance, stability, and generalization, and directly affect usability in GPS-unreliable scenarios such as search-and-rescue and inspection.

Approach

Use minimalist dual-view perception: retain only forward-facing and downward-facing images, vertically concatenate them, and feed them into the OpenVLA-7B visual encoder to reduce redundant input and latency while supporting both forward navigation and ground alignment for landing.
Use fuzzy directional prompts: discretize relative orientation from onboard IMU/GPS into coarse language prompts such as “straight ahead” and “forward-right,” replacing step-by-step oracle instructions and forcing the model to rely more on active visual localization.
Use numerically tokenized action outputs: discretize continuous 3-DoF actions (\langle \Delta x, \Delta z, \Delta\psi \rangle) into 99 bins and map them directly to existing numeric tokens in the LLM, avoiding the need to relearn a special action vocabulary.
Unify navigation and landing within a single policy: the model can output LAND or output a near-zero displacement action as a stop signal, eliminating the need for an external detector to trigger landing.
Training uses behavior cloning and adds geometric-consistency filtering, removing about 4% of training frames where fuzzy prompts and expert actions are clearly contradictory.

Results

On the TravelUAV Seen test set, AerialVLA achieves 47.96% SR, 38.54% SPL, 65.88 NE, and 57.69% OSR.
Compared with the strongest baseline LongFly, it improves on the Seen set to +11.57 SR (47.96 vs. 36.39) and +7.47 SPL (38.54 vs. 31.07); the paper also reports an SR advantage of +12.36 (46.30 vs. 33.94) on the Hard subset.
Compared with NavFoM, Seen-set SR rises from 29.17% to 47.96%; compared with TravelUAV-DA, SR rises from 17.45% to 47.96%.
In computational efficiency, AerialVLA requires 17GB VRAM and 0.38s total latency on an RTX 4090, better than TravelUAV’s 20GB and 0.63s; its own VLA inference takes 0.35s, and fuzzy prompting adds only 0.03s.
Data and training scale: it uses the TravelUAV UAV-Need-Help task, trained on 7,922 trajectories / 420k frames, with testing including 1,418 Seen trajectories, 629 Unseen Object trajectories, and 958 Unseen Map trajectories.
The abstract claims that in unseen scenarios it achieves “nearly 3× success rate” relative to leading baselines, but the current excerpt does not include the corresponding full table values, so this cannot be verified line by line.

Link

http://arxiv.org/abs/2603.14363v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart