---
source: arxiv
url: http://arxiv.org/abs/2603.06858v1
published_at: '2026-03-06T20:22:36'
authors:
- Zheng Yu
- Wenxuan Shi
- Xinqian Sun
- Zheyun Feng
- Meng Xu
- Xinyu Xing
topics:
- automated-vulnerability-repair
- patch-validation
- llm-for-code
- benchmark-dataset
- software-security
relevance_score: 0.9
run_id: materialize-outputs
language_code: en
---

# Patch Validation in Automated Vulnerability Repair

## Summary
This paper points out that the patch validation methods commonly used in existing Automated Vulnerability Repair (AVR) research overestimate repair success rates because they ignore the new tests developers add alongside official patches. The authors propose using PoC[+] tests for stricter validation and construct the PVBench benchmark, containing 209 real vulnerability cases, to quantify this issue.

## Problem
- Existing AVR evaluations typically only check whether the original functional tests pass and whether the PoC no longer triggers the vulnerability; however, this may fail to verify whether a patch aligns with developer intent.
- When submitting real patches, developers often add new tests. These tests not only prevent crashes, but also encode **root cause location, correct fix strategy, program specifications, and style constraints**.
- Therefore, if these newly added tests are ignored, the “repair success rate” of AVR systems will be systematically overestimated, which can mislead judgments about LLM repair capability.

## Approach
- The authors propose **PoC[+] tests**: tests newly added by developers around the official patch, using them to replace or supplement the traditional validation method of “original test suite + PoC.”
- They build **PVBench**: covering **20 open-source projects, 209 vulnerability cases, and 12 CWE categories**, with each case including both basic tests (original functional tests + PoC) and PoC[+] tests.
- They re-evaluate 3 SOTA AVR systems: **PatchAgent, San2Patch, SWE-Agent**, comparing the gap between “judged correct by basic tests” and “judged correct by PoC[+].”
- The authors also categorize PoC[+] tests into three types by mechanism: **output-checking, intermediate-checking, self-checking**, illustrating how these tests capture richer program semantics.
- They manually analyze patches that pass/fail PoC[+], concluding that the main issues concentrate in **root-cause analysis, program specification adherence, developer intention capture**.

## Results
- The authors release **PVBench**, containing **209** real vulnerability cases from **20** projects and covering **12** CWE categories; examples include **CWE-476 with 52 cases, CWE-617 with 40, CWE-122 with 34, CWE-416 with 32, and CWE-190 with 26**.
- Re-evaluation on **PatchAgent, San2Patch, SWE-Agent** finds that **more than 40%** of patches judged correct by traditional basic tests fail under **PoC[+]** testing, showing substantial **false discovery / success-rate overestimation** in existing evaluation methods.
- To validate the reliability of PoC[+], the authors manually compare patches that pass PoC[+] with developer patches and find that **more than 70%** achieve **semantic equivalence**; the remainder mainly have performance or quality issues, indicating that PoC[+] captures developer intent reasonably well, though it is still not perfect.
- The strongest concrete evidence on dataset size and distribution includes: **php with 43 vulnerabilities, cpython with 33, llvm with 26, v8 with 24, libxml2 with 19, and icu with 15**.
- The paper excerpt does not provide precise per-tool success-rate numbers for the three AVR systems, nor a complete table broken down by dataset/tool; the most important quantitative conclusion is that **>40% of patches deemed “correct” by basic tests are overturned by PoC[+]**.

## Link
- [http://arxiv.org/abs/2603.06858v1](http://arxiv.org/abs/2603.06858v1)
