Recoleta Item Note

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

SWE-CI is a new benchmark for long-term codebase evolution and maintenance capabilities. Rather than only checking whether a one-off fix passes tests, it evaluates whether agents can continuously maintain code quality…

benchmarkcode-maintenancecontinuous-integrationllm-agentssoftware-engineering

SWE-CI is a new benchmark for long-term codebase evolution and maintenance capabilities. Rather than only checking whether a one-off fix passes tests, it evaluates whether agents can continuously maintain code quality through multi-round, continuous-integration-style iterations. The paper’s core contribution is turning “maintainability” into a measurable object and constructing 100 tasks from real repositories’ long-term commit histories.

  • Existing code benchmarks mostly use one-shot, snapshot-style evaluation and only measure functional correctness, making them unable to distinguish between a “temporary patch” and a “design that can evolve over the long term.”
  • Real-world software development is primarily about long-term maintenance and requirement iteration, and maintenance costs account for 60%–80% of total software lifecycle costs. Therefore, evaluating only one-off fixes does not reflect industrial reality.
  • There is a lack of repository-level evaluation benchmarks that can explicitly observe the accumulation of technical debt, regression control, and the difficulty of subsequent modifications.
  • Proposes SWE-CI: the first repository-level code maintenance benchmark based on the Continuous Integration loop, extracting long-term evolution segments of base commit → target commit from real GitHub Python repositories.
  • The dataset contains 100 tasks from 68 repositories; each task spans an average of 233 days and 71 consecutive commits, with at least 500 lines of source-code changes (excluding tests), emphasizing non-trivial long-term evolution.
  • Designs an Architect–Programmer two-agent protocol: based on the testing gap between the current code and the target code, the Architect generates no more than 5 high-level incremental requirements; the Programmer then interprets the requirements, plans, and modifies the code, forming up to 20 rounds of CI-style iteration.
  • Introduces two levels of metrics: normalized change uses the �[-1,1]� interval to measure test progress/regression of the current code relative to the baseline and target; EvoScore computes a future-weighted average over results from each round, assigning higher weights to later iterations, thereby incorporating long-term maintainability into the score.
  • To ensure reproducibility, the paper automatically builds Docker environments for samples and adds a self-repair process for missing dependencies; it ultimately narrows 4,923 repositories to 8,311 candidate spans, then to 1,458 runnable candidates, and finally selects the top 100 tasks.
  • The paper conducts a large-scale evaluation of 18 models from 8 providers, consuming more than 10 billion tokens in total; the results show that newer models within the same provider usually outperform older ones, indicating that code maintenance capabilities are steadily improving.
  • In overall performance, the authors claim that the Claude Opus series is “clearly ahead” throughout the observation period, and GLM-5 also performs strongly; however, the excerpt does not provide specific EvoScore values.
  • In terms of long-term maintenance stability, most models have a zero-regression rate (proportion with no regression throughout) below 0.25; only two Claude-opus series models exceed 0.5, indicating that current models still generally struggle to stably avoid regression in long-term, multi-round maintenance.
  • By adjusting the future-weight parameter γ in EvoScore, the authors find that different providers show different preferences for “short-term gains vs. long-term maintainability”: MiniMax, DeepSeek, and GPT lean more toward long-term gains; Kimi and GLM lean more toward short-term gains; Qwen, Doubao, and Claude are relatively stable.
  • The strongest conclusion is that even though the most advanced models have made clear progress on static repair, they still show a significant gap in long-term, automated, multi-round codebase maintenance; SWE-CI can diagnose this gap better than snapshot-style benchmarks.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.