Noise budgets and comment routing layers for code review agents
A “comment routing and threshold control” layer could be built for engineering teams using code review agents: instead of directly optimizing for more review comments, classify agent outputs into Bug Hit / Valid Suggestion / Noise, and dynamically decide when to post comments automatically versus when to keep them as background suggestions based on repository risk level, PR size, and historical acceptance rate. This is closer to the current pain point than simply scaling models further, because what teams actually lack is production-ready noise governance.
Because real PR-level evaluation has already shown that the main barrier to deploying code review agents is not “failing to find more issues,” but “too much noise making teams unwilling to turn them on.” This creates a better opportunity now for a process control layer rather than yet another general-purpose review agent.
Evaluation criteria have shifted from a single detection rate to developer acceptability. CR-Bench incorporates Usefulness Rate and SNR as core metrics, and quantifies the real tradeoff where Reflexion improves recall but significantly lowers SNR.
Select a mid-sized engineering team already using a code review agent, replay the most recent 200 PRs offline, and compare three strategies: full comments, high-confidence-only comments, and background ranked suggestions; use comment acceptance rate, change in review duration, and developers’ subjective burden to verify whether this outperforms the status quo.
- CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
- CR-Bench shows that code review agents cannot be evaluated by recall alone; Usefulness and SNR must also be considered, and in real PR environments high recall often comes with high noise.
- The paper emphasizes that code review lacks clear pass/fail signals like compilation/testing, and the cost of false positives directly harms developer adoption.