Recoleta Item Note

O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

O3N proposes the first purely visual, end-to-end panoramic open-vocabulary 3D occupancy prediction framework, aiming to reconstruct geometry and scalable semantics simultaneously from a single 360° image. It primarily…

Embodied AI

open-vocabulary-occupancyomnidirectional-perception3d-scene-understandingmambaembodied-perception

Open arXiv Source markdown

Summary

O3N proposes the first purely visual, end-to-end panoramic open-vocabulary 3D occupancy prediction framework, aiming to reconstruct geometry and scalable semantics simultaneously from a single 360° image. It primarily addresses panoramic distortion, long-range context modeling, and semantic alignment for unseen categories, and achieves SOTA on QuadOcc and Human360Occ.

Problem

Existing 3D occupancy prediction methods typically rely on limited-view inputs and closed category sets, making them difficult to meet the 360° safe perception needs of embodied agents in the open world.
Panoramic ERP images have geometric distortion and non-uniform sampling, which disrupt spatial continuity and increase the risk of semantic sparsity in distant regions and training overfitting.
Under the open-vocabulary setting, alignment across the pixel-voxel-text tri-modal space can easily fail because training only sees base classes, leading to poor generalization to novel classes.

Approach

Proposes O3N: given a single panoramic RGB image and category text, it directly predicts open-vocabulary 3D occupancy; the paper claims it is the first purely visual end-to-end framework for this task.
Uses Polar-spiral Mamba (PsM) to perform spiral scanning and dual-branch modeling on polar/cylindrical voxels. Put simply, it aggregates information from near to far in an order that better matches 360° geometry, then fuses this with Cartesian voxels to improve long-range context and spatial continuity modeling.
Uses Occupancy Cost Aggregation (OCA) to first compute a cost volume measuring “how well voxel features match text features,” then performs spatial aggregation and category aggregation, instead of directly hard-aligning discrete features, thereby reducing open-vocabulary overfitting.
Uses Natural Modality Alignment (NMA) for gradient-free text-prototype alignment: it repeatedly fuses text embeddings with semantic prototypes derived from pixel features to obtain a more consistent shared semantic space, alleviating the gap among pixel/voxel/text modalities.
The framework can be trained on top of occupancy networks such as MonoScene and SGN, with losses consisting of semantic occupancy supervision, voxel-pixel alignment, and OCA loss.

Results

On QuadOcc, the paper reports improvements of +2.21 mIoU and +3.01 Novel mIoU over the baseline.
On Human360Occ, the paper reports improvements of +0.86 mIoU and +1.54 Novel mIoU over the baseline.
The QuadOcc results shown in Figure 1 indicate that O3N reaches 16.54 mIoU and 21.16 Novel mIoU, and the paper claims this is SOTA on that benchmark.
The paper claims to outperform existing open-vocabulary occupancy methods on both panoramic occupancy benchmarks, QuadOcc and Human360Occ, and to surpass some fully supervised methods.
In the dataset setup, QuadOcc treats vehicle/road/building as novel classes, accounting for about 68% of all voxels; Human360Occ treats 7 classes as novel, accounting for about 75%, indicating that the evaluation is fairly challenging in the open-vocabulary setting.
The abstract also claims notable cross-scene generalization and semantic scalability, but the provided excerpt does not include more detailed full-table metrics broken down by dataset or model.

Link

http://arxiv.org/abs/2603.12144v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart