---
source: arxiv
url: http://arxiv.org/abs/2603.03823v1
published_at: '2026-03-04T08:20:25'
authors:
- Jialong Chen
- Xander Xu
- Hu Wei
- Chuan Chen
- Bing Zhao
topics:
- benchmark
- code-maintenance
- continuous-integration
- llm-agents
- software-engineering
relevance_score: 0.94
run_id: materialize-outputs
language_code: en
---

# SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

## Summary
SWE-CI is a new benchmark for long-term codebase evolution and maintenance capabilities. Rather than only checking whether a one-off fix passes tests, it evaluates whether agents can continuously maintain code quality through multi-round, continuous-integration-style iterations. The paper’s core contribution is turning “maintainability” into a measurable object and constructing 100 tasks from real repositories’ long-term commit histories.

## Problem
- Existing code benchmarks mostly use one-shot, snapshot-style evaluation and only measure functional correctness, making them unable to distinguish between a “temporary patch” and a “design that can evolve over the long term.”
- Real-world software development is primarily about long-term maintenance and requirement iteration, and maintenance costs account for **60%–80%** of total software lifecycle costs. Therefore, evaluating only one-off fixes does not reflect industrial reality.
- There is a lack of repository-level evaluation benchmarks that can explicitly observe the accumulation of technical debt, regression control, and the difficulty of subsequent modifications.

## Approach
- Proposes **SWE-CI**: the first repository-level code maintenance benchmark based on the **Continuous Integration** loop, extracting long-term evolution segments of **base commit → target commit** from real GitHub Python repositories.
- The dataset contains **100 tasks** from **68 repositories**; each task spans an average of **233 days** and **71 consecutive commits**, with at least **500 lines** of source-code changes (excluding tests), emphasizing non-trivial long-term evolution.
- Designs an **Architect–Programmer two-agent protocol**: based on the testing gap between the current code and the target code, the Architect generates no more than **5** high-level incremental requirements; the Programmer then interprets the requirements, plans, and modifies the code, forming up to **20 rounds** of CI-style iteration.
- Introduces two levels of metrics: **normalized change** uses the  [-1,1]  interval to measure test progress/regression of the current code relative to the baseline and target; **EvoScore** computes a future-weighted average over results from each round, assigning higher weights to later iterations, thereby incorporating long-term maintainability into the score.
- To ensure reproducibility, the paper automatically builds Docker environments for samples and adds a self-repair process for missing dependencies; it ultimately narrows **4,923** repositories to **8,311** candidate spans, then to **1,458** runnable candidates, and finally selects the top **100** tasks.

## Results
- The paper conducts a large-scale evaluation of **18 models from 8 providers**, consuming more than **10 billion tokens** in total; the results show that newer models within the same provider usually outperform older ones, indicating that code maintenance capabilities are steadily improving.
- In overall performance, the authors claim that the **Claude Opus series** is “clearly ahead” throughout the observation period, and **GLM-5** also performs strongly; however, the excerpt does not provide specific EvoScore values.
- In terms of long-term maintenance stability, most models have a **zero-regression rate (proportion with no regression throughout) below 0.25**; only **two Claude-opus series models exceed 0.5**, indicating that current models still generally struggle to stably avoid regression in long-term, multi-round maintenance.
- By adjusting the future-weight parameter **γ** in EvoScore, the authors find that different providers show different preferences for “short-term gains vs. long-term maintainability”: **MiniMax, DeepSeek, and GPT** lean more toward long-term gains; **Kimi and GLM** lean more toward short-term gains; **Qwen, Doubao, and Claude** are relatively stable.
- The strongest conclusion is that even though the most advanced models have made clear progress on static repair, they still show a significant gap in **long-term, automated, multi-round codebase maintenance**; SWE-CI can diagnose this gap better than snapshot-style benchmarks.

## Link
- [http://arxiv.org/abs/2603.03823v1](http://arxiv.org/abs/2603.03823v1)