---
kind: trend
trend_doc_id: 422
granularity: day
period_start: '2026-03-11T00:00:00'
period_end: '2026-03-12T00:00:00'
topics:
- code-reasoning
- software-engineering-agents
- evaluation
- security
- agent-auditing
run_id: materialize-outputs
aliases:
- recoleta-trend-422
tags:
- recoleta/trend
- topic/code-reasoning
- topic/software-engineering-agents
- topic/evaluation
- topic/security
- topic/agent-auditing
language_code: en
---

# Code intelligence moves toward process learning, while software agents shift toward realistic evaluation and auditable execution

## Overview
Today’s research focus is quite concentrated: code and software engineering continue heating up, but the discussion is no longer just about “models writing better code.” Instead, it is about “whether the process can be learned, whether the result can be verified, and whether execution can be audited.” The strongest thread is process supervision . One class of work has begun rethinking the idea that static repository snapshots can represent real development. Understanding by Reconstruction unfolds repositories backward into trajectories of requirements, planning, reading, writing, and debugging, then uses those trajectories for continued pretraining. Another class of work directly rewards intermediate execution states. ExecVerify trains code execution reasoning with verifiable step-level rewards, letting smaller models approach larger ones in code understanding and transferring the gains to code generation. The second thread is software engineering agents becoming more like engineering systems . iSWE Agent does not simply maximize tool freedom; instead, it specializes around Java repository issue fixing with a dedicated division of labor: first localize, then edit, while constraining the process with read-only static analysis tools and rule-based sanitization.

## Evolution

Compared with the historical windows, this period shows three clearest changes. First, evaluation continues moving closer to real environments, but across a wider set of targets: no longer limited to code review or GUI agents, it now reaches RTL synthesis and deployment stability. Second, code model training is continuing to move beyond structured representations toward process learning, with reconstructing development trajectories and rewarding intermediate execution steps as representative examples. Third, security governance has not receded, but today it emphasizes verifiable artifacts, such as independently verifiable browser-operation evidence, rather than only high-level principles.

### Real-world engineering evaluation continues to deepen

- Change: Continuing
- History windows: [Software engineering agents shift toward real-wo… (2026-03-10)](day--2026-03-10--trend--378.md), [Code agents move toward verifiable closed loops… (2026-03-09)](day--2026-03-09--trend--330.md)

Compared with CR-Bench and SpecOps in [Software engineering agents shift toward real-wo… (2026-03-10)](day--2026-03-10--trend--378.md), today’s main thread of “more realistic evaluation” has not cooled down; instead, it has expanded from software agents to hardware generation and verification methods. *Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation* places 32 models into a synthesis-in-the-loop pipeline over 202 Verilog tasks with 5 samples per task, and finds that the best-of-5 pass rate is on average 7.5 points higher than Global HQI, with GPT-4.1 even 13.9 points higher. This goes a step beyond the evaluations in [Software engineering agents shift toward real-wo… (2026-03-10)](day--2026-03-10--trend--378.md) that emphasized real PR and GUI scenarios, and begins directly measuring “can it synthesize, what is the quality, and is a single run stable.”

### Code model training shifts from structural representation to process learning

- Change: Shifting
- History windows: [Code agents move toward verifiable closed loops… (2026-03-09)](day--2026-03-09--trend--330.md), [Structured code intelligence, long-running agent… (2026-03-08)](day--2026-03-08--trend--284.md)

Compared with the “structured code intelligence” work in [Code agents move toward verifiable closed loops… (2026-03-09)](day--2026-03-09--trend--330.md)’s SWE-Fuse and [Structured code intelligence, long-running agent… (2026-03-08)](day--2026-03-08--trend--284.md)’s KCoEvo, today’s training focus shifts further from structural representation to process supervision. *Understanding by Reconstruction* no longer relies only on repository snapshots, but reverse-generates about 4B tokens of development trajectories from roughly 300k repositories and performs 20B tokens of continued pretraining; *ExecVerify* directly assigns verifiable rewards to intermediate execution steps, raising a 7B model’s average reasoning score from 60.8 to 80.8. The key change is that models are no longer just looking at structure and outcomes, but are beginning to explicitly learn planning, reading, execution, and intermediate states.

### Security governance moves from principles to verifiable execution evidence

- Change: Continuing
- History windows: [Software engineering agents shift toward real-wo… (2026-03-10)](day--2026-03-10--trend--378.md), [Code agents move toward verifiable closed loops… (2026-03-09)](day--2026-03-09--trend--330.md), [Structured code intelligence, long-running agent… (2026-03-08)](day--2026-03-08--trend--284.md)

Security and governance remain an ongoing theme, but today they lean more toward operational artifacts. [Structured code intelligence, long-running agent… (2026-03-08)](day--2026-03-08--trend--284.md) discussed moving agent security upstream into dataflow governance, and [Software engineering agents shift toward real-wo… (2026-03-10)](day--2026-03-10--trend--378.md) mentioned protocolized connections evolving toward security and governance design; today, Conduit records each browser-agent action into a SHA-256 hash chain and signs the session at the end with Ed25519, generating a proof bundle containing an action log, hash chain, signature, and public key. Compared with earlier governance discussions focused more on architecture and protocol layers, this now introduces an audit component that can be directly integrated into MCP workflows.

## Clusters

### Code intelligence shifts toward process supervision and verifiable reasoning

Code and software engineering research is continuing to shift its focus from “final code” to “process trajectories.” *Understanding by Reconstruction* reverse-synthesizes roughly 4B tokens of development trajectories from about 300k GitHub repositories, then performs 20B tokens of continued pretraining on Llama-3-8B-Instruct. The results show that this kind of data, which includes traces of planning, reading, writing, and debugging, can simultaneously improve long-context understanding, code generation, and some agentic tasks. On another front, *ExecVerify* breaks code execution reasoning into verifiable intermediate steps, using white-box reinforcement learning to raise a 7B model’s average score from 60.8 to 80.8, and transfers the gains to code generation. Together, these two works show that code intelligence is moving from “learning outcomes” to “learning processes.”

#### Representative sources
- [Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining](../Inbox/2026-03-11--understanding-by-reconstruction-reversing-the-software-development-process-for-llm-pretraining.md) — Zhiyuan Zeng; Yichi Zhang; Yong Shan; Kai Hua; Siyuan Fang; Zhaiyu Liu; …
- [ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning](../Inbox/2026-03-11--execverify-white-box-rl-with-verifiable-stepwise-rewards-for-code-execution-reasoning.md) — Lingxiao Tang; He Ye; Zhaoyang Chu; Muyang Ye; Zhongxin Liu; Xiaoxue Ren; …


### Software engineering agents move toward language-specific, low-side-effect repair

The software engineering agent line continues to move toward more concrete repository operations, but today it puts more emphasis on language-specific toolchains. *Resolving Java Code Repository Issues with iSWE Agent* splits issue fixing into two sub-agents for localization and editing, and connects 7 read-only Java static analysis tools to the localization stage. It reports near-best or state-of-the-art results on the 128-instance Multi-SWE-bench Java subset and the 165-instance SWE-PolyBench Java subset, while reducing API cost to 2× to 3× lower than other leading systems under the same base model. Compared with relying only on general-purpose bash/code execution, this kind of “rules + model” design looks more like an enterprise repository setting.

#### Representative sources
- [Resolving Java Code Repository Issues with iSWE Agent](../Inbox/2026-03-11--resolving-java-code-repository-issues-with-iswe-agent.md) — Jatin Ganhotra; Sami Serhan; Antonio Abu Nassar; Avraham Shinnar; Ziv Nevo; Martin Hirzel


### Evaluation shifts from pass rates to deployment quality and low-cost verification

Another strong signal today is that evaluation is becoming much closer to real deployment. *Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation* no longer looks only at simulation pass rates, but connects syntax, synthesis, functionality, and hardware quality into one pipeline. Across 32 models, 202 Verilog tasks, and 5 samples per task, the authors find that the best-of-5 pass rate is on average 7.5 points higher than Global HQI, showing that “it runs” does not mean “it is deployable.” Similarly, *From Verification to Herding* is also redefining software verification by arguing for approaching optimality with fewer samples: EZR reaches an average 90% optimality with 32 samples across 63 tasks. Both are pushing evaluation from single-point success rates toward a more complete loop of quality and cost.

#### Representative sources
- [Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes](../Inbox/2026-03-11--synthesis-in-the-loop-evaluation-of-llms-for-rtl-generation-quality-reliability-and-failure-modes.md) — Weimin Fu; Zeng Wang; Minghao Shao; Ramesh Karri; Muhammad Shafique; Johann Knechtel; …
- [From Verification to Herding: Exploiting Software's Sparsity of Influence](../Inbox/2026-03-11--from-verification-to-herding-exploiting-software-s-sparsity-of-influence.md) — Tim Menzies; Kishan Kumar Ganguly


### Security capability evaluation and agent auditing rise in parallel

Security and governance remain present today, but the framing is more engineering-oriented. *TOSSS* turns software security capability into a real before/after CVE patch code choice task, covering 14 models with 500 C/C++ and 500 Java samples per model; scores range from about 0.48 to 0.89, and explicit security prompting still brings an average gain of +0.021 to +0.029. On the other side, Conduit turns browser agent behavior into a proof bundle with SHA-256 hash chains and Ed25519 signatures, giving web actions verifiable audit evidence. The former measures security judgment, while the latter adds execution auditing; together they reflect the idea that “security is not only about model outputs, but also about process accountability.”

#### Representative sources
- [Show HN:Conduit–Headless browser with SHA-256 hash chain - Ed25519 audit trails](../Inbox/2026-03-11--show-hn-conduit-headless-browser-with-sha-256-hash-chain-ed25519-audit-trails.md) — TaxFix
- [FP-Predictor - False Positive Prediction for Static Analysis Reports](../Inbox/2026-03-11--fp-predictor-false-positive-prediction-for-static-analysis-reports.md) — Tom Ohlmer; Michael Schlichtig; Eric Bodden