$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs
This paper proposes π-StepNFT, an online reinforcement learning method for flow-based vision-language-action models, which fine-tunes robot policies in a step-wise, critic-free, and explicit-likelihood-free manner. The…
Summary
This paper proposes π-StepNFT, an online reinforcement learning method for flow-based vision-language-action models, which fine-tunes robot policies in a step-wise, critic-free, and explicit-likelihood-free manner. The core idea is: when the exploration space becomes wider, supervision must become finer, so the method uses noisy SDEs to expand exploration and then uses step-wise ranking signals to stabilize alignment.
Problem
- Flow-based VLAs are powerful for robotic control, but under multi-step sampling their action likelihoods are difficult to compute precisely, making standard PPO/policy-gradient-style online RL hard to apply directly.
- Pure ODE sampling explores too narrowly, so the policy can easily get stuck near expert trajectories; once it deviates at test time, recovery is poor. This matters in real manipulation because small errors can accumulate into failure.
- Directly introducing the more stochastic SDE exploration also creates a supervision mismatch: if correction is applied only coarsely at the final output, accumulated noise can make training unstable and worsen alignment.
Approach
- Use SDE sampling instead of pure ODE for action generation during training, injecting structured noise into the denoising process to actively expand the behavioral space the policy can explore.
- Change the supervision target from the final denoised result x0 to the adjacent one-step transition
x_t -> x_t-, i.e., supervise the next small step step by step rather than looking only at the endpoint; this is more local and lower-variance. - Do not train an additional value/critic network and do not compute explicit action likelihoods; instead, use only the Gaussian form of the SDE one-step transition to compare errors against the observed next-step state.
- Construct two mirrored branches around the old policy (positive/negative perturbations), then use a logistic contrastive ranking loss: for successful trajectories, push the “positive branch explains the transition better than the negative branch,” and for failed trajectories do the opposite, achieving a push-pull update.
- Each optimization step requires only a single forward pass, while trust-region-style mirrored perturbations and an EMA rollout policy keep updates stable.
Results
- On LIBERO, the paper claims that π-StepNFT improves over SFT by 32.9%, and emphasizes that it unlocks the potential of flow-based VLAs in few-shot settings.
- In visually diversified OOD scenarios on ManiSkill, the method improves over critic/value-based baselines by 11.1%; the paper attributes this to avoiding critic overfitting to multimodal features.
- The paper also claims competitive few-shot robustness, but the provided excerpt does not include more detailed task-level numbers, dataset splits, or full tables against each specific baseline.
- Strong method-level claims include: no auxiliary value network needed, no explicit likelihood needed, and only one forward pass per optimization step, with the goal of serving complex real-world robotic applications more scalably.
Link
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.