Recoleta Item Note

Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

vision-language-action dexterous-manipulationbimanual-roboticstactile-force-fusionhierarchical-policyshared-autonomy

Summary

This paper proposes an integrated framework for human-like bimanual dexterous manipulation: RL-trained IMCopilot assists teleoperation and serves as a low-level skill module during execution, while MoDE-VLA robustly incorporates force/tactile sensing into a pretrained VLA. Targeting high-DoF, contact-rich in-hand manipulation, it claims about a 2x success-rate improvement over the baseline across 4 tasks.

Problem

Existing VLAs mostly remain limited to low-DoF grippers and simple pick-and-place, and are difficult to extend to 63-DoF human-like in-hand manipulation and bimanual coordination with two arms and two hands.
High-quality demonstration data are hard to collect: pure teleoperation struggles to stably complete multi-finger coordination and in-hand rotation, especially for contact-rich tasks such as apple peeling.
A single policy has difficulty covering coarse motion, force-control phases such as insertion/cutting, and tactile-driven in-hand adjustment at the same time; meanwhile, directly concatenating force/tactile inputs into a pretrained VLA may also damage its original capabilities.

Approach

Proposes IMCopilot: a set of RL-trained atomic in-hand skills (e.g., stable grasping, rotation around a specified axis). During data collection, they are triggered by the human operator via a foot pedal to help complete the hardest in-hand stages; during autonomous execution, they are likewise called by trigger signals output by the VLA, forming hierarchical control.
Proposes MoDE-VLA: beyond a pretrained OpenPI-0 / PaliGemma-style VLA backbone, it builds separate force and tactile channels instead of simply concatenating inputs.
Uses arm joint torques as the force modality and 10 fingertip 6-DoF tactile/force-torque readings as the tactile modality; after projection into tokens, they interact through self-attention together with backbone context and autoregressive/flow-matching action states.
Uses a sparse Mixture-of-Experts to select experts by token/time step, learning different correction patterns for different contact phases (approach, initial contact, stable grasping, dynamic rotation).
Through residual injection, force mainly corrects arm actions and tactile mainly corrects hand actions; when IMCopilot is triggered, hand actions can be directly taken over by the RL skills.

Results

In the comparison of in-hand manipulation capability, IMCopilot significantly outperforms pure teleoperation: ping-pong ball 3/30→25/30 (10%→83%), tennis ball 20/30→28/30 (67%→93%), apple 8/30→27/30 (27%→90%), overall 31/90→80/90 (34%→89%).
The paper evaluates 4 contact-rich tasks: gear assembling, charger plugging, test tube rearranging, and apple peeling; each method is tested for 20 trials per task, with Success Rate as the main metric, and apple peeling additionally reports Peel Completion Ratio.
The abstract claims that on dexterous contact-rich tasks, compared with the baseline, it achieves "doubled success rate improvement", i.e., success rate improves to roughly 2x the previous level; the explicit baseline is the pretrained (\pi_0).
The paper also claims, to the best of its knowledge, the first autonomous apple peeling with dual dexterous hands, a composite task requiring the joint contribution of vision, force, touch, bimanual coordination, and in-hand rotation.
Limited by the provided excerpt, the complete per-task numeric tables and ablation results are not fully available; the strongest quantitative evidence currently comes mainly from Table I and the abstract’s claim of about 2x success-rate improvement.

Link

http://arxiv.org/abs/2603.08122v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart