Trend brief · 2026-03-03

Code agents are shifting from “can write” to “can verify, collaborate, and ship”

Today’s software engineering direction is highly concentrated: people are no longer just comparing who can write code better, but are instead filling in the gaps of code agents for real tasks, closed-loop verification,…

7 tracked topics

Today’s software engineering direction is highly concentrated: people are no longer just comparing who can write code better, but are instead filling in the gaps of code agents for real tasks, closed-loop verification, and production deployment. Main observations

  • Evaluation is getting harder. BeyondSWE expands tasks from local fixes in a single repository to cross-repo work, domain knowledge, dependency migration, and generating repositories from documentation. The results show that current models still have relatively low success rates on more realistic tasks.
  • Verification is moving earlier. From compilable skeletons and probabilistic regression testing to change-aware differential GUI testing, the research focus is shifting from “generate an answer” to “prove it isn’t broken.”

Code agents are entering more realistic software engineering evaluation

Evaluation of code agents is clearly moving away from the comfort zone of “single-repo bug fixing.” BeyondSWE expands tasks to cross-repo work, domain knowledge, dependency migration, and generating repositories from documentation, showing that the current best average performance is only about 41.82%, far below the 80%+ commonly seen on traditional SWE benchmarks. SearchSWE also shows that external search is not a stable gain; search and coding still have not been truly integrated.

Representative sources

Closed-loop verification is becoming the main battleground in agent development

Several works this period focus on “making systems verifiable first, then making generation stronger.” His2Trans first recovers build context and sets up a compilable skeleton before translating functions incrementally; AgentAssay turns testing for nondeterministic agents into probabilistic regression testing with statistical guarantees; RippleGUItester performs differential GUI exploratory testing around code changes. The shared signal is that verification, compilation, and regression detection are becoming core infrastructure for agent development.

Representative sources

Multi-model programming is shifting from piling on workflow steps to optimizing interaction order

In code generation, more complex pipelines are not necessarily better. Review Beats Planning finds that in dual-model collaboration, “review-then-fix” outperforms “plan-then-code,” reaching 90.2% pass@1 on HumanEval+, while plan-then-code actually falls below the code-model baseline. This suggests that multi-model system design is shifting from “more steps means stronger” to “is the interaction order correct?”

Representative sources

Agent deployment is shifting toward environment isolation, permission control, and remote execution

Both engineering practice articles and systems architecture work are emphasizing that for agents to be deployed, they need isolated execution environments, stable permission boundaries, and faster verification infrastructure. Worktree-based parallel development, remote Bazel runners, and tool-level authorization based on user intent correspond respectively to three implementation points: concurrent development, build verification, and security control. Most of these lack a unified benchmark, but the direction is consistent: moving agents from “can write” to “can run safely and deliver continuously.”

Representative sources

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

NewerWorld models are rapidly shifting toward structured state, while robot VLA is simultaneously moving toward deployability and self-repairOlderVLA is moving toward continuous dynamics, fast inference, and long-horizon memory