Recoleta Item Note

Structural Action Transformer for 3D Dexterous Manipulation

dexterous-manipulationcross-embodiment-transfer3d-point-cloudstransformer-policyflow-matchingrobot-imitation-learning

Summary

This paper proposes the Structural Action Transformer (SAT) for cross-embodiment imitation learning with high-DoF dexterous hands, rewriting actions from a time-ordered sequence into a joint-ordered 3D structural sequence. This representation allows a single Transformer to naturally handle hands with different numbers of joints, enabling better transfer and sample efficiency on large-scale heterogeneous human/robot datasets.

Problem

The goal is to solve cross-embodiment skill transfer for high-DoF dexterous hands on heterogeneous datasets: different hand types, joint counts, and kinematic structures vary significantly, making it difficult for traditional imitation learning to share skills.
Existing methods mostly use 2D observations and time-centric action representations $(T, D_a)$, which struggle to express the 3D spatial relationships required for fine manipulation and cannot naturally align action dimensions across different embodiments.
This matters because if data cannot be reused across human hands, robot hands, and simulation platforms, dexterous manipulation policies will be hard to scale into general-purpose high-DoF robot foundation models.

Approach

The core idea is to reconstruct an action chunk from the traditional $(T, D_a)$ time sequence into a $(D_a, T)$ joint sequence: each token no longer represents the full-hand action at a given time step, but instead represents the trajectory of a single joint over a future time window.
With this design, different robots differ only in sequence length $D_a$; Transformers natively support variable-length sequences, so they can more naturally handle heterogeneous embodiments and learn correspondences between joint functions.
To tell the model "which joint this is, what it does, and how it rotates," the authors design an Embodied Joint Codebook, using a triplet (embodiment id, functional category, rotation axis) to add structural-prior embeddings to each joint.
The input uses 3D point cloud history + language instructions: point clouds are processed with FPS + PointNet to extract local/global tokens, and language is encoded with T5; these observation tokens are fed together with structured action tokens into DiT.
During training, the model does not directly regress actions, but instead uses continuous-time flow matching to learn the velocity field from Gaussian noise to action chunks; at inference time, it uses an ODE solver to generate the full action segment, which the paper says can be done with 1-NFE generation.

Results

On 11 simulated dexterous manipulation tasks (3 from Adroit, 4 from DexArt, 4 from Bi-DexHands), SAT achieves an average success rate of 0.71±0.04, outperforming all comparison methods.
Compared with 3D baselines: SAT 0.71±0.04 vs 3D ManiFlow Policy 0.66±0.04 vs 3D Diffusion Policy 0.63±0.06; this is an average success-rate gain of 0.05 and 0.08, respectively.
By dataset, SAT reaches 0.75±0.02 / 0.73±0.03 / 0.67±0.05 on Adroit/DexArt/Bi-DexHands; the strongest corresponding baseline, 3D ManiFlow, achieves 0.70±0.02 / 0.70±0.03 / 0.59±0.07.
Compared with 2D methods, SAT shows an even larger advantage: average success rate 0.71 versus UniAct 0.50, HPT 0.47, and Diffusion Policy 0.42.
It is also highly parameter-efficient: SAT has only 19.36M parameters, yet surpasses 218.9M for 3D ManiFlow, 255.2M for 3D Diffusion Policy, and 1053M for UniAct.
Ablations show that temporal compression dimensions 32/64/128 all achieve a 0.71 success rate; the 64-dimensional configuration corresponds to 19.36M parameters, 0.99G FLOPs (1-NFE). The paper also claims better sample efficiency and effective cross-embodiment transfer, but the excerpt does not provide more detailed sample-efficiency curve values.

Link

http://arxiv.org/abs/2603.03960v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart