Code agents are entering more realistic software engineering evaluation
Evaluation of code agents is clearly moving away from the comfort zone of “single-repo bug fixing.” BeyondSWE expands tasks to cross-repo work, domain knowledge, dependency migration, and generating repositories from documentation, showing that the current best average performance is only about 41.82%, far below the 80%+ commonly seen on traditional SWE benchmarks. SearchSWE also shows that external search is not a stable gain; search and coding still have not been truly integrated.
Representative sources
- BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? — Guoxin Chen; Fanzhe Meng; Jiale Zhao; Minghao Li; Daixuan Cheng; Huatong Song; …