Recoleta Item Note

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

benchmarkcode-maintenancecontinuous-integrationllm-agentssoftware-engineering

Summary

SWE-CI is a new benchmark for long-term codebase evolution and maintenance capabilities. Rather than only checking whether a one-off fix passes tests, it evaluates whether agents can continuously maintain code quality through multi-round, continuous-integration-style iterations. The paper’s core contribution is turning “maintainability” into a measurable object and constructing 100 tasks from real repositories’ long-term commit histories.

Problem

Existing code benchmarks mostly use one-shot, snapshot-style evaluation and only measure functional correctness, making them unable to distinguish between a “temporary patch” and a “design that can evolve over the long term.”
Real-world software development is primarily about long-term maintenance and requirement iteration, and maintenance costs account for 60%–80% of total software lifecycle costs. Therefore, evaluating only one-off fixes does not reflect industrial reality.
There is a lack of repository-level evaluation benchmarks that can explicitly observe the accumulation of technical debt, regression control, and the difficulty of subsequent modifications.

Approach

Proposes SWE-CI: the first repository-level code maintenance benchmark based on the Continuous Integration loop, extracting long-term evolution segments of base commit → target commit from real GitHub Python repositories.
The dataset contains 100 tasks from 68 repositories; each task spans an average of 233 days and 71 consecutive commits, with at least 500 lines of source-code changes (excluding tests), emphasizing non-trivial long-term evolution.
Designs an Architect–Programmer two-agent protocol: based on the testing gap between the current code and the target code, the Architect generates no more than 5 high-level incremental requirements; the Programmer then interprets the requirements, plans, and modifies the code, forming up to 20 rounds of CI-style iteration.
Introduces two levels of metrics: normalized change uses the �[-1,1]� interval to measure test progress/regression of the current code relative to the baseline and target; EvoScore computes a future-weighted average over results from each round, assigning higher weights to later iterations, thereby incorporating long-term maintainability into the score.
To ensure reproducibility, the paper automatically builds Docker environments for samples and adds a self-repair process for missing dependencies; it ultimately narrows 4,923 repositories to 8,311 candidate spans, then to 1,458 runnable candidates, and finally selects the top 100 tasks.

Results

The paper conducts a large-scale evaluation of 18 models from 8 providers, consuming more than 10 billion tokens in total; the results show that newer models within the same provider usually outperform older ones, indicating that code maintenance capabilities are steadily improving.
In overall performance, the authors claim that the Claude Opus series is “clearly ahead” throughout the observation period, and GLM-5 also performs strongly; however, the excerpt does not provide specific EvoScore values.
In terms of long-term maintenance stability, most models have a zero-regression rate (proportion with no regression throughout) below 0.25; only two Claude-opus series models exceed 0.5, indicating that current models still generally struggle to stably avoid regression in long-term, multi-round maintenance.
By adjusting the future-weight parameter γ in EvoScore, the authors find that different providers show different preferences for “short-term gains vs. long-term maintainability”: MiniMax, DeepSeek, and GPT lean more toward long-term gains; Kimi and GLM lean more toward short-term gains; Qwen, Doubao, and Claude are relatively stable.
The strongest conclusion is that even though the most advanced models have made clear progress on static repair, they still show a significant gap in long-term, automated, multi-round codebase maintenance; SWE-CI can diagnose this gap better than snapshot-style benchmarks.

Link

http://arxiv.org/abs/2603.03823v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart