Recoleta Item Note

GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

GST-VLA proposes a VLA framework that compresses monocular depth and semantic features into 3D Gaussian spatial tokens, and adds a supervisable depth-aware chain of thought to explicitly reason about 3D geometry before…

vision-language-action3d-representationdepth-aware-reasoningrobot-manipulationgaussian-tokens

GST-VLA proposes a VLA framework that compresses monocular depth and semantic features into 3D Gaussian spatial tokens, and adds a supervisable depth-aware chain of thought to explicitly reason about 3D geometry before action generation. The paper claims this structure is better suited than traditional 2D patch or scalar-depth representations for high-precision manipulation, and achieves higher success rates on LIBERO and SimplerEnv.

  • Existing VLA models typically use only 2D image patch tokens and lack explicit 3D geometry, surface orientation, and geometric confidence, causing fine-grained tasks such as grasping and insertion to fail more easily.
  • Methods that only add dense depth maps still use “one scalar per pixel,” which cannot express local surface normals/curvature and also waste the token budget uniformly on irrelevant regions.
  • The spatial reasoning process from vision to action is usually entirely implicit, lacking inspectable and supervisable intermediate 3D understanding steps.
  • Use a Gaussian Spatial Tokenizer to convert frozen semantic patch features and frozen monocular depth into 3D anisotropic Gaussian primitives; each token contains a 3D mean residual, 3-axis covariance, and opacity, representing position refinement, surface direction/shape, and geometric confidence, respectively.
  • First construct a Gaussian field from 256 raw spatial tokens, then compress it to 128 tokens through learned spatial attention pooling, concentrating the fixed token budget on geometrically more important regions.
  • Add Depth-Aware Chain-of-Thought to the VLM to explicitly generate four types of intermediate spatial reasoning outputs: 3D object grounding, grasp contact geometry, pairwise metric distances, and coarse SE(3) waypoints, and train them as supervised targets.
  • During DA-CoT generation, every VLM layer can cross-attend to the uncompressed 256 raw Gaussian tokens, enabling direct querying of fine-grained geometric regions rather than relying only on the compressed representation.
  • On the action side, a 300M-parameter flow-matching expert is used, conditioned simultaneously on VLM hidden states and DA-CoT outputs through dual cross-attention, to predict 10-step 7-DoF delta action chunks.
  • The paper claims 96.4% success on LIBERO, an improvement of +2.0 percentage points over the baseline.
  • It reaches 80.2% on SimplerEnv, an improvement of +5.4 percentage points over the baseline.
  • Key GST component ablations: removing the residual mean (\mu_k) loses 1.9 percentage points; changing anisotropic to isotropic covariance loses 1.6 percentage points; fixing opacity at (\alpha=1) loses 1.5 percentage points; replacing 3D Fourier positional encoding with 2D learned PE loses 2.8 percentage points; replacing spatial attention pooling with average pooling loses 2.1 percentage points.
  • DA-CoT component ablations: removing 3D object grounding (c_1) loses 1.9 percentage points; removing the SE(3) waypoint thought (c_4) loses 2.3 percentage points, the largest effect among the four thoughts.
  • Action expert ablations: removing the conditioning branch from DA-CoT action tokens loses 3.1 percentage points; replacing the MoE feedforward with a single dense FFN loses 1.7 percentage points.
  • The authors also claim these gains are more concentrated on high-precision tasks, especially in grasp accuracy, collision avoidance, and overall task success rate, but the excerpt does not provide finer-grained per-task numbers.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.