ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
This paper proposes ReflexiCoder: a code LLM training framework that uses reinforcement learning to internalize the process of “write code first, then self-reflect, then self-correct” into the model parameters, thereby…
Summary
This paper proposes ReflexiCoder: a code LLM training framework that uses reinforcement learning to internalize the process of “write code first, then self-reflect, then self-correct” into the model parameters, thereby reducing reliance on test harnesses, execution environments, or external critics. It targets the single-pass generation bottleneck on complex algorithmic problems, with a focus on autonomous debugging at inference time while using fewer tokens.
Problem
- Existing code LLMs often rely on single forward-pass generation ("System 1"), which tends to produce code that looks plausible but is actually wrong on complex, multi-step algorithmic tasks.
- Existing iterative repair methods usually depend on external oracles, unit tests, execution feedback, or costly multi-round prompting, which are often unavailable in real development settings and incur high latency.
- The key question is: how can a model, during inference without external feedback, identify bugs, reflect, and correct code by itself? This matters because it directly affects the autonomy, deployment cost, and practical usability of coding agents.
Approach
- The core method models code generation as a structured trajectory:
reasoning -> answer -> reflection -> corrected answer, and uses RL to directly optimize the entire “self-debugging trajectory,” not just the first answer, into the model weights. - The authors adopt an RL-zero paradigm: instead of first relying on traditional SFT to teach a fixed reflection template, they let the model autonomously learn reflection-correction patterns suited to its own parameter space through rewards.
- The reward function is compositional: it first requires strict format compliance; it then rewards quality improvement at each round and final test passing, while penalizing too many reflection rounds and additionally encouraging larger improvements in fewer rounds.
- Unlike prior approaches such as CodeRL/PPOCoder that “optimize single-pass generation + execution rewards,” ReflexiCoder optimizes the trajectory of “finding problems and fixing them,” aiming to turn debugging into an intrinsic model capability.
- Training is based on fine-tuning Qwen3-8B into ReflexiCoder-8B, with policy optimization using reflection-aware GRPO; at inference time, it can answer directly in single-pass mode or run in a multi-step reflection mode with a system prompt.
Results
- The authors claim ReflexiCoder-8B establishes a new SOTA in the 1.5B-14B open-source model range; in the single-attempt setting it achieves: HumanEval 94.51%, HumanEval+ 87.20%, MBPP 81.80%, MBPP+ 78.57%, BigCodeBench 35.00%, LiveCodeBench 52.21%, CodeForces 37.34%.
- Under the multi-round reflection setting, results improve further to: HumanEval 95.73%, BigCodeBench 36.84%, LiveCodeBench 54.12%, CodeForces 37.68%; the abstract does not provide multi-round MBPP values.
- In comparison with proprietary models, the paper explicitly reports: on HumanEval+, ReflexiCoder reaches 87.20%, tying GPT-5.1 at 87.20%; on LiveCodeBench, ReflexiCoder at 52.21% exceeds GPT-5.1 at 48.03%; on CodeForces, ReflexiCoder at 37.34% exceeds GPT-5.1 at 34.70%.
- The paper also claims more efficient inference: in multi-round mode, compared with the base model, inference token overhead is reduced by about 40%, with reasoning tokens reduced by nearly 50%.
- At the behavioral level, the authors say the model exhibits a highly disciplined reflection pattern: it almost always performs exactly 1 reflection cycle, implying that its gains do not mainly come from unbounded multi-round sampling, but from learned efficient self-correction.
Link
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.