VLAs shift toward active perception, lightweight multimodal fusion, and deployment-grade system optimization
Today’s robotics papers are highly concentrated: VLAs continue heating up, but the focus is not just on becoming larger or more talkative—it is on seeing better, parallelizing better, and getting closer to real…
Overview
Today’s robotics papers are highly concentrated: VLAs continue heating up, but the focus is not just on becoming larger or more talkative—it is on seeing better, parallelizing better, and getting closer to real deployment. The strongest signal comes from active perception. VLA-Thinker no longer treats the image as one-shot context, but allows the model to revisit local regions during reasoning. This change is simple but powerful: it reaches 97.5% on LIBERO , 6.5 points above OpenVLA-OFT ; on the Long subset, it is 10.4 points higher, suggesting that it mainly fixes disambiguation and error correction in long-horizon processes. The second theme is “enhance perception, but do not make the system heavy.” TacFiLM injects tactile signals as a conditioning signal into intermediate visual layers without increasing input token length, yet in real contact tasks it simultaneously improves success rate, reduces applied force, and shortens time. R3DP , meanwhile, uses a fast-slow branch design to incorporate a 3D foundation model, placing heavy computation on sparse keyframes and adding 3D understanding while preserving real-time performance. The third theme comes from the systems layer.
Evolution
Compared with [Robotics research shifts toward closed-loop data… (2026-03-12)](day--2026-03-12--trend--435.md), robotics research is still centered on VLAs and long-horizon manipulation, but today emphasizes two things more strongly: first, letting models keep looking and keep perceiving during execution; second, turning multi-task inference and cross-robot scaling into deployable systems.
The clearest continuation signal comes from VLA-Thinker. In [Robotics research shifts toward closed-loop data… (2026-03-12)](day--2026-03-12--trend--435.md), “active perception” was more a directional judgment; today we see an implementation that writes visual revisitation directly into the reasoning trajectory, with clear gains on LIBERO Long and RoboTwin long-horizon tasks.
The clearest shift signal comes from the systems side. [Robotics research shifts toward closed-loop data… (2026-03-12)](day--2026-03-12--trend--435.md) focused on data loops such as RADAR and RoboClaw, while today OxyGen shifts attention to KV sharing, cross-frame batching, and edge-side throughput, indicating that bottlenecks in the robotics stack are moving toward the runtime serving layer.
The new signal comes from tactile and 3D. Both TacFiLM and R3DP emphasize “add little burden, gain immediate benefit”: the former reduces force and time in contact-rich tasks, while the latter adds 3D understanding while preserving real-time performance.
Attention in robot system closed loops shifts toward inference scheduling and edge deployment
ShiftingDexterous manipulation extends from data infrastructure to tactile and 3D perception enhancement
EmergingClusters
VLAs move toward active perception and long-horizon reasoning
The strongest signal today is that VLAs are beginning to shift from “look once, then act” toward continuous closed-loop perception. VLA-Thinker turns visual queries into reasoning actions: during thinking, it calls ZOOM-IN to revisit local regions before outputting an action. This design mainly improves disambiguation and mid-course error correction in long-horizon manipulation. On LIBERO, it reaches a 97.5% success rate, a 6.5-point gain over OpenVLA-OFT; on the Long subset, the gain is 10.4 points. On RoboTwin 2.0 long/extra-long-horizon tasks, it achieves an average success rate of 64.6%, up 18.1 points from 46.5%.
Representative sources
- VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning — Chaoyang Wang; Wenrui Bao; Sicheng Gao; Bingxin Xu; Yu Tian; Yogesh S. Rawat; …
Multimodality and 3D priors begin landing in lightweight form
Another clear theme is bringing new modalities and new spatial priors into policies without making the system heavy. TacFiLM uses FiLM to inject tactile embeddings into intermediate visual layers without increasing language input length. Across more than 700 real-robot rollouts, it achieves an ID average success rate of 86.67%, 15.56 points above the second-best baseline, while reducing average peak force to 8.34 N. R3DP, meanwhile, places the heavy 3D model in a slow branch and uses a fast branch to complete intermediate-frame features; on RoboTwin's 10 tasks, it reaches a 69.0% average success rate, 32.9 points higher than DP-single, while cutting encoding latency from 73.1 ms to 40.3 ms.
Representative sources
- Tactile Modality Fusion for Vision-Language-Action Models — Charlotte Morissette; Amin Abyaneh; Wei-Di Chang; Anas Houssaini; David Meger; Hsiu-Chin Lin; …
- R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation — Yuhao Zhang; Wanxi Dong; Yue Shi; Yi Liang; Jingnan Gao; Qiaochu Yang; …
System-level efficiency and cross-robot scaling become new focal points
A further set of papers today is no longer mainly competing on model size, but instead addressing deployment and scalability gaps. OxyGen manages VLA KV cache as a cross-task shared resource and achieves up to 3.7× multi-task inference speedup on a single RTX 4090, while sustaining 200+ tokens/s language throughput and a 70 Hz action rate. WestWorld, by contrast, targets a unified world model across multiple robots, using Sys-MoE and structural embeddings to handle heterogeneous morphologies; after pretraining on 89 environments, it achieves MAE 7.737 on unseen Franka, clearly outperforming TrajWorld's 13.102.
Representative sources
- OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism — Xiangyu Li; Huaizhi Tang; Xin Ding; Weijun Wang; Ting Cao; Yunxin Liu
- WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems — Yuchen Wang; Jiangtao Kong; Sizhe Wei; Xiaochang Li; Haohong Lin; Hongjue Zhao; …
VLAs and data infrastructure extend toward UAVs and humanoid systems
Robot embodiments and tasks are also continuing to expand outward. AerialVLA brings the VLA framework to UAV navigation, retaining only front-view and downward-view dual perspectives, and replaces dense oracle instructions with fuzzy directional prompts. On the TravelUAV Seen split, it reaches 47.96% SR, 11.57 points above LongFly, while reducing total latency to 0.38 seconds. OmniClone, meanwhile, turns whole-body teleoperation into general-purpose data infrastructure: end-to-end latency is about 80 ms, it significantly outperforms GMT and Twist2 on multiple dynamic categories in OmniBench, and it can use collected data to train a VLA, reaching 85.71% success on real-world Pick-and-Place.
Representative sources
- AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control — Peng Xu; Zhengnan Deng; Jiayan Deng; Zonghua Gu; Shaohua Wan
- OmniClone: Engineering a Robust, All-Rounder Whole-Body Humanoid Teleoperation System — Yixuan Li; Le Ma; Yutang Lin; Yushi Du; Mengya Liu; Kaizhe Hu; …
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.