Recoleta Item Note
EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
EvoDriveVLA proposes a collaborative distillation framework for autonomous driving VLA models that improves both perception and planning. The core idea is to use “self-anchoring” to protect visual representations and an…
autonomous-drivingvision-language-actionknowledge-distillationtrajectory-planningperception-planningclosed-loop-evaluation
Summary
EvoDriveVLA proposes a collaborative distillation framework for autonomous driving VLA models that improves both perception and planning. The core idea is to use “self-anchoring” to protect visual representations and an “oracle teacher” to provide stronger trajectory supervision, thereby improving both open-loop and closed-loop driving performance.
Problem
- Existing autonomous driving VLAs often damage the general perception capabilities learned during pretraining when the visual encoder is unfrozen during fine-tuning, leading to perceptual degradation.
- Long-horizon trajectory planning is prone to instability; meanwhile, in conventional distillation, if the teacher is trained under the same conditions as the student, its planning ability has no clear advantage and it is difficult for it to provide high-quality guidance.
- Existing multi-trajectory distillation methods typically rely on predefined planning vocabularies, so trajectory diversity and scene adaptability remain limited, which affects generalization and safety in real driving.
Approach
- Proposes collaborative perception-planning distillation: jointly distilling perception and planning, rather than only distilling the final trajectory.
- On the perception side, it uses self-anchored visual distillation: first copying the student’s current visual encoder as a “self-anchored teacher,” then constraining the student’s visual tokens during training not to drift too far, so that the original visual capabilities are preserved while adapting to driving tasks.
- Designs AnchorFormer, which uses instructions, vehicle state, and ground-truth future trajectories to assign different anchoring strengths to different visual regions; key regions more relevant to future trajectories are constrained more strongly.
- On the planning side, it builds a future-aware oracle teacher that uses future images and future ego states, first generating coarse trajectories and then applying coarse-to-fine refinement to obtain better trajectory candidates.
- It then uses MC-dropout sampling to generate more high-quality, diverse candidates with relatively low additional cost, and selects the candidate with the smallest cross-entropy to the ground truth as the soft target, performing two-level distillation on the student’s hidden states and logits.
Results
- Achieves SOTA on nuScenes open-loop evaluation. Using the ST-P3 protocol as an example, EvoDriveVLA attains an average L2 error of 0.26 m, outperforming DiMA 0.27 m, OpenDriveVLA 0.33 m, and OmniDrive 0.33 m; its 3s L2 is 0.43 m, better than DiMA 0.44 m.
- On nuScenes / ST-P3 collision, the average collision rate is 0.06%, tied with DistillDrive 0.06%, and better than DiMA 0.08% and OpenDriveVLA 0.10%; 3s collision is 0.12%, better than DiMA 0.15%.
- On nuScenes open-loop evaluation (UniAD protocol), the average L2 is 0.52 m, outperforming DiMA 0.57 m, OpenDriveVLA 0.67 m, and GPT-Driver 0.84 m; the 1s/2s/3s L2 values are 0.16/0.44/0.96 m, respectively.
- However, under UniAD protocol collision, not all metrics are best: EvoDriveVLA has an average collision rate of 0.12%, and compared with DiMA 0.07% and OpenDriveVLA 0.30%, the best-performing components are inconsistent; for example, 2s collision is 0.02%, better than DiMA 0.05%, but 3s is 0.33%, worse than DiMA 0.16%.
- On NAVSIM closed-loop navtest, EvoDriveVLA achieves PDMS of 85.3, outperforming PARA-Drive 84.0, TransFuser 84.0, UniAD 83.4, and QwenVL2.5-8B 83.3; meanwhile, EP=81.1, higher than UniAD 78.8 and InternVL3-8B 78.9.
- Other closed-loop metrics also reach best or tied-best results: NC 98.0, DAC 93.3, TTC 93.1, Comfort 100, overall showing that it not only improves open-loop prediction accuracy but also enhances real decision-making performance in the closed loop.
Link
Built with Recoleta
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.