Recoleta Item Note

GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

vision-language-action3d-representationdepth-aware-reasoningrobot-manipulationgaussian-tokens

Summary

GST-VLA proposes a VLA framework that compresses monocular depth and semantic features into 3D Gaussian spatial tokens, and adds a supervisable depth-aware chain of thought to explicitly reason about 3D geometry before action generation. The paper claims this structure is better suited than traditional 2D patch or scalar-depth representations for high-precision manipulation, and achieves higher success rates on LIBERO and SimplerEnv.

Problem

Existing VLA models typically use only 2D image patch tokens and lack explicit 3D geometry, surface orientation, and geometric confidence, causing fine-grained tasks such as grasping and insertion to fail more easily.
Methods that only add dense depth maps still use “one scalar per pixel,” which cannot express local surface normals/curvature and also waste the token budget uniformly on irrelevant regions.
The spatial reasoning process from vision to action is usually entirely implicit, lacking inspectable and supervisable intermediate 3D understanding steps.

Approach

Use a Gaussian Spatial Tokenizer to convert frozen semantic patch features and frozen monocular depth into 3D anisotropic Gaussian primitives; each token contains a 3D mean residual, 3-axis covariance, and opacity, representing position refinement, surface direction/shape, and geometric confidence, respectively.
First construct a Gaussian field from 256 raw spatial tokens, then compress it to 128 tokens through learned spatial attention pooling, concentrating the fixed token budget on geometrically more important regions.
Add Depth-Aware Chain-of-Thought to the VLM to explicitly generate four types of intermediate spatial reasoning outputs: 3D object grounding, grasp contact geometry, pairwise metric distances, and coarse SE(3) waypoints, and train them as supervised targets.
During DA-CoT generation, every VLM layer can cross-attend to the uncompressed 256 raw Gaussian tokens, enabling direct querying of fine-grained geometric regions rather than relying only on the compressed representation.
On the action side, a 300M-parameter flow-matching expert is used, conditioned simultaneously on VLM hidden states and DA-CoT outputs through dual cross-attention, to predict 10-step 7-DoF delta action chunks.

Results

The paper claims 96.4% success on LIBERO, an improvement of +2.0 percentage points over the baseline.
It reaches 80.2% on SimplerEnv, an improvement of +5.4 percentage points over the baseline.
Key GST component ablations: removing the residual mean (\mu_k) loses 1.9 percentage points; changing anisotropic to isotropic covariance loses 1.6 percentage points; fixing opacity at (\alpha=1) loses 1.5 percentage points; replacing 3D Fourier positional encoding with 2D learned PE loses 2.8 percentage points; replacing spatial attention pooling with average pooling loses 2.1 percentage points.
DA-CoT component ablations: removing 3D object grounding (c_1) loses 1.9 percentage points; removing the SE(3) waypoint thought (c_4) loses 2.3 percentage points, the largest effect among the four thoughts.
Action expert ablations: removing the conditioning branch from DA-CoT action tokens loses 3.1 percentage points; replacing the MoE feedforward with a single dense FFN loses 1.7 percentage points.
The authors also claim these gains are more concentrated on high-precision tasks, especially in grasp accuracy, collision avoidance, and overall task success rate, but the excerpt does not provide finer-grained per-task numbers.

Link

http://arxiv.org/abs/2603.09079v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart