Recoleta Item Note

RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

repository-buildtest-automationllm-agentssoftware-benchmarkingcross-platformmultilingual-code

Summary

RepoLaunch is an automation agent for building and testing repositories across arbitrary programming languages and operating systems, aimed at automating the high-human-cost task of “getting a code repository running.” It further extends this capability into an automated pipeline for generating SWE benchmarks/training data.

Problem

Installing dependencies, compiling, running tests, and handling platform differences vary greatly across code repositories, and documentation is often incomplete, making executable environment setup highly dependent on repeated manual trial and error.
Existing SWE evaluation and training increasingly require large-scale, executable, reproducible experimental sandboxes; but manually preparing build/test environments for massive numbers of repositories does not scale.
Previous methods were mostly limited to Python/Linux or rule-based templates; but real GitHub repositories span multiple languages, frameworks, and platforms, and GitTaskBench reports that in about 65% of cases, the agent cannot even set up the environment.

Approach

Proposes a three-stage, multi-agent workflow: Preparation → Build → Release. Preparation first scans repository files, selects an appropriate base image, and injects language-specific build/test prompts.
The Setup Agent freely executes shell commands in a container and can use WebSearch to look up external information, attempting to install dependencies, compile the project, and find regression tests; if “most tests pass,” it hands off to the Verify Agent.
The Verify Agent reviews command history and test results to avoid Setup Agent hallucinations; if verification fails, it rolls back and retries. After success, it commits the image to form a reusable environment.
The Organize Agent distills minimal reconstruction commands, test commands, and a test log parser from historical execution traces; it prioritizes structured outputs such as JSON/XML and can optionally generate per-test commands and a Dockerfile.
At the simplest level, its core mechanism is: first let an agent try to get the repository running like an engineer would, then let another agent verify the result, and finally compress the successful experience into repeatedly executable scripts and parsers.

Results

RepoLaunch demonstrates cross-platform capability across 9 languages/settings; the paper overall claims a repository build success rate of about 70%, while covering both Linux and Windows.
Automated dataset creation results (Table 1): Python 906/1200 = 75.5% build success; C/C++ 297/400 = 74.3%; C# 269/350 = 76.9%; Java 267/350 = 76.3%; JS/TS 483/700 = 69.0%; Go 211/350 = 60.3%; Rust 259/350 = 74.0%; Windows overall 258/400 = 64.5%.
The retention rate in the Release stage on instances that had already built successfully is also relatively high: C/C++ 261/297 = 87.9%; C# 206/269 = 76.6%; Java 203/267 = 76.0%; JS/TS 422/483 = 87.3%; Go 182/211 = 86.3%; Rust 216/259 = 83.4%; Windows 206/258 = 79.8%.
RepoLaunch supported generation of SWE-bench-Live/MultiLang: a total of 413 tasks, 234 repositories, exceeding SWE-bench-Multilingual’s 300 tasks, 41 repositories; it also built SWE-bench-Live/Windows, sampling 400 evaluations from 507 Windows-specific issues.
On the MultiLang benchmark generated by RepoLaunch, existing agent+LLM combinations still have relatively low overall success rates: on Linux the best average is about 28.4% (SWE-agent+Claude-4.5, ClaudeCode+GPT-5.2, and ClaudeCode+Claude-4.5 are all around 28.4%); single-language bests include 44.1% on Go (ClaudeCode+Claude-4.5) and 43.8% on C/C++ (SWE-agent/OpenHands + Claude-4.5).
On the Windows benchmark, the best Win-agent + Claude-4.5 achieves a 30.0% solve rate, GPT-5.2 20.0%, Gemini-3 16.0%, and DeepSeek-V3.1 20.0%. The paper uses this to emphasize: cross-platform repository building can now be automated at a practical scale, but truly end-to-end solving of SWE tasks remains difficult.

Link

http://arxiv.org/abs/2603.05026v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart