Recoleta Item Note
ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning
ExecVerify trains code models with verifiable rewards on intermediate program-execution steps instead of only imitating teacher explanation text. It turns “read code and infer execution” into a reinforcement learning…
Summary
ExecVerify trains code models with verifiable rewards on intermediate program-execution steps instead of only imitating teacher explanation text. It turns “read code and infer execution” into a reinforcement learning problem where answers can be checked step by step, and shows that this capability can also transfer to code generation.
Problem
- Existing training for code execution reasoning mostly relies on SFT to learn teacher-written explanation chains, but during training it is not possible to explicitly verify whether intermediate execution steps are actually correct, making it easy to learn “text imitation” rather than semantic understanding.
- Training data typically lacks controllable difficulty and structural coverage, mixing in samples that are too easy or nearly unsolvable, which hurts small models’ ability to learn the true execution process.
- Weak code execution reasoning further degrades downstream tasks such as code generation, program repair, and semantic understanding, making it a key bottleneck in code intelligence.
Approach
- First, perform constraint-based data synthesis: automatically generate programs around Python built-in types, methods, and control-flow patterns, and construct curriculum-style data with multiple difficulty levels under structural constraints from simple to complex.
- Execute each program to obtain interpreter traces, then automatically generate two types of white-box verifiable questions: next executed statement prediction (control-flow) and variable value/type prediction (data-flow).
- Train the model with white-box reinforcement learning: rewards consider not only whether the final I/O is correct, but also whether intermediate-step questions are answered correctly; the reward function combines final-state correctness with step-level correctness.
- Also add reverse O→I prediction, requiring the model to find executable inputs from outputs, reducing reliance on forward pattern matching alone.
- Use two-stage training: the first stage improves execution reasoning; the second stage applies code-generation RL with unit-test rewards, transferring reasoning ability to generating functionally correct programs.
Results
- On code execution reasoning, the 7B base model improves from an average of 60.8 to 80.8 (
+ SFT + white-box RL), higher than 76.3 for+ SFT + I/O RL, and also above the average score of 77.9 from Qwen2.5-Coder-32B-Instruct. - In detailed metrics,
+ SFT + white-box RLreaches CRUXEval-O 85.6, LiveCodeBench-Exec 82.3, REval State 74.5, and REval Path 73.0; compared with the 7B base’s 61.0 / 58.0 / 51.7 / 49.7, this shows clear gains and suggests especially effective learning of white-box intermediate states. - On code generation, the best two-stage model
+ SFT + white-box RL + UT RLaverages 57.1, higher than pure+ UT RLat 53.9,+ I/O RL + UT RLat 54.6, and+ SFT + I/O RL + UT RLat 54.9; the paper claims up to a 5.9-point pass@1 improvement over strong post-training baselines. - Specific generation metrics include HumanEval+ 84.8, MBPP+ 75.1, LiveCodeBench Hard 5.9, and BigCodeBench Hard 25.7; all improve over the 7B base’s 84.1 / 71.7 / 3.0 / 18.2.
- For data construction, the authors start from 239,992 original samples and 239,466 mutated samples; after execution-based filtering, 201,537 and 191,463 are retained, then difficulty filtering yields 119,358 training samples; Stage I actually uses 30K SFT + 30K RL.
- Ablations show the two white-box question types are complementary: the full version averages 80.8, control-flow only gets 79.9, and data-flow only gets 78.1. On library-related I/O prediction, the 7B base scores 56.0,
SFT+I/O RLscores 62.5, andSFT+White-Box RLscores 64.7, close to the 32B model’s 70.4.
Link
Built with Recoleta
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.