Recoleta Item Note

Structural Action Transformer for 3D Dexterous Manipulation

This paper proposes the Structural Action Transformer (SAT) for cross-embodiment imitation learning with high-DoF dexterous hands, rewriting actions from a time-ordered sequence into a joint-ordered 3D structural…

dexterous-manipulationcross-embodiment-transfer3d-point-cloudstransformer-policyflow-matchingrobot-imitation-learning

This paper proposes the Structural Action Transformer (SAT) for cross-embodiment imitation learning with high-DoF dexterous hands, rewriting actions from a time-ordered sequence into a joint-ordered 3D structural sequence. This representation allows a single Transformer to naturally handle hands with different numbers of joints, enabling better transfer and sample efficiency on large-scale heterogeneous human/robot datasets.

  • The goal is to solve cross-embodiment skill transfer for high-DoF dexterous hands on heterogeneous datasets: different hand types, joint counts, and kinematic structures vary significantly, making it difficult for traditional imitation learning to share skills.
  • Existing methods mostly use 2D observations and time-centric action representations $(T, D_a)$, which struggle to express the 3D spatial relationships required for fine manipulation and cannot naturally align action dimensions across different embodiments.
  • This matters because if data cannot be reused across human hands, robot hands, and simulation platforms, dexterous manipulation policies will be hard to scale into general-purpose high-DoF robot foundation models.
  • The core idea is to reconstruct an action chunk from the traditional $(T, D_a)$ time sequence into a $(D_a, T)$ joint sequence: each token no longer represents the full-hand action at a given time step, but instead represents the trajectory of a single joint over a future time window.
  • With this design, different robots differ only in sequence length $D_a$; Transformers natively support variable-length sequences, so they can more naturally handle heterogeneous embodiments and learn correspondences between joint functions.
  • To tell the model "which joint this is, what it does, and how it rotates," the authors design an Embodied Joint Codebook, using a triplet (embodiment id, functional category, rotation axis) to add structural-prior embeddings to each joint.
  • The input uses 3D point cloud history + language instructions: point clouds are processed with FPS + PointNet to extract local/global tokens, and language is encoded with T5; these observation tokens are fed together with structured action tokens into DiT.
  • During training, the model does not directly regress actions, but instead uses continuous-time flow matching to learn the velocity field from Gaussian noise to action chunks; at inference time, it uses an ODE solver to generate the full action segment, which the paper says can be done with 1-NFE generation.
  • On 11 simulated dexterous manipulation tasks (3 from Adroit, 4 from DexArt, 4 from Bi-DexHands), SAT achieves an average success rate of 0.71±0.04, outperforming all comparison methods.
  • Compared with 3D baselines: SAT 0.71±0.04 vs 3D ManiFlow Policy 0.66±0.04 vs 3D Diffusion Policy 0.63±0.06; this is an average success-rate gain of 0.05 and 0.08, respectively.
  • By dataset, SAT reaches 0.75±0.02 / 0.73±0.03 / 0.67±0.05 on Adroit/DexArt/Bi-DexHands; the strongest corresponding baseline, 3D ManiFlow, achieves 0.70±0.02 / 0.70±0.03 / 0.59±0.07.
  • Compared with 2D methods, SAT shows an even larger advantage: average success rate 0.71 versus UniAct 0.50, HPT 0.47, and Diffusion Policy 0.42.
  • It is also highly parameter-efficient: SAT has only 19.36M parameters, yet surpasses 218.9M for 3D ManiFlow, 255.2M for 3D Diffusion Policy, and 1053M for UniAct.
  • Ablations show that temporal compression dimensions 32/64/128 all achieve a 0.71 success rate; the 64-dimensional configuration corresponds to 19.36M parameters, 0.99G FLOPs (1-NFE). The paper also claims better sample efficiency and effective cross-embodiment transfer, but the excerpt does not provide more detailed sample-efficiency curve values.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.