---
source: arxiv
url: http://arxiv.org/abs/2603.05026v1
published_at: '2026-03-05T10:15:13'
authors:
- Kenan Li
- Rongzhi Li
- Linghao Zhang
- Qirui Jin
- Liao Zhu
- Xiaosong Huang
- Geng Zhang
- Yikai Zhang
- Shilin He
- Chengxing Xie
- Xin Zhang
- Zijian Jin
- Bowen Li
- Chaoyun Zhang
- Yu Kang
- Yufan Huang
- Elsie Nallipogu
- Saravan Rajmohan
- Qingwei Lin
- Dongmei Zhang
topics:
- repository-build
- test-automation
- llm-agents
- software-benchmarking
- cross-platform
- multilingual-code
relevance_score: 0.95
run_id: materialize-outputs
language_code: en
---

# RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

## Summary
RepoLaunch is an automation agent for building and testing repositories across arbitrary programming languages and operating systems, aimed at automating the high-human-cost task of “getting a code repository running.” It further extends this capability into an automated pipeline for generating SWE benchmarks/training data.

## Problem
- Installing dependencies, compiling, running tests, and handling platform differences vary greatly across code repositories, and documentation is often incomplete, making executable environment setup highly dependent on repeated manual trial and error.
- Existing SWE evaluation and training increasingly require large-scale, executable, reproducible experimental sandboxes; but manually preparing build/test environments for massive numbers of repositories does not scale.
- Previous methods were mostly limited to **Python/Linux** or rule-based templates; but real GitHub repositories span multiple languages, frameworks, and platforms, and GitTaskBench reports that in about **65%** of cases, the agent cannot even set up the environment.

## Approach
- Proposes a three-stage, multi-agent workflow: **Preparation → Build → Release**. Preparation first scans repository files, selects an appropriate base image, and injects language-specific build/test prompts.
- The **Setup Agent** freely executes shell commands in a container and can use WebSearch to look up external information, attempting to install dependencies, compile the project, and find regression tests; if “most tests pass,” it hands off to the **Verify Agent**.
- The **Verify Agent** reviews command history and test results to avoid Setup Agent hallucinations; if verification fails, it rolls back and retries. After success, it commits the image to form a reusable environment.
- The **Organize Agent** distills minimal reconstruction commands, test commands, and a test log parser from historical execution traces; it prioritizes structured outputs such as JSON/XML and can optionally generate per-test commands and a Dockerfile.
- At the simplest level, its core mechanism is: **first let an agent try to get the repository running like an engineer would, then let another agent verify the result, and finally compress the successful experience into repeatedly executable scripts and parsers**.

## Results
- RepoLaunch demonstrates cross-platform capability across **9 languages/settings**; the paper overall claims a repository build success rate of about **70%**, while covering both **Linux and Windows**.
- Automated dataset creation results (Table 1): Python **906/1200 = 75.5%** build success; C/C++ **297/400 = 74.3%**; C# **269/350 = 76.9%**; Java **267/350 = 76.3%**; JS/TS **483/700 = 69.0%**; Go **211/350 = 60.3%**; Rust **259/350 = 74.0%**; Windows overall **258/400 = 64.5%**.
- The retention rate in the Release stage on instances that had already built successfully is also relatively high: C/C++ **261/297 = 87.9%**; C# **206/269 = 76.6%**; Java **203/267 = 76.0%**; JS/TS **422/483 = 87.3%**; Go **182/211 = 86.3%**; Rust **216/259 = 83.4%**; Windows **206/258 = 79.8%**.
- RepoLaunch supported generation of **SWE-bench-Live/MultiLang**: a total of **413 tasks, 234 repositories**, exceeding SWE-bench-Multilingual’s **300 tasks, 41 repositories**; it also built **SWE-bench-Live/Windows**, sampling **400** evaluations from **507** Windows-specific issues.
- On the MultiLang benchmark generated by RepoLaunch, existing agent+LLM combinations still have relatively low overall success rates: on Linux the best average is about **28.4%** (SWE-agent+Claude-4.5, ClaudeCode+GPT-5.2, and ClaudeCode+Claude-4.5 are all around **28.4%**); single-language bests include **44.1%** on Go (ClaudeCode+Claude-4.5) and **43.8%** on C/C++ (SWE-agent/OpenHands + Claude-4.5).
- On the Windows benchmark, the best Win-agent + Claude-4.5 achieves a **30.0%** solve rate, GPT-5.2 **20.0%**, Gemini-3 **16.0%**, and DeepSeek-V3.1 **20.0%**. The paper uses this to emphasize: **cross-platform repository building can now be automated at a practical scale, but truly end-to-end solving of SWE tasks remains difficult**.

## Link
- [http://arxiv.org/abs/2603.05026v1](http://arxiv.org/abs/2603.05026v1)
