---
source: arxiv
url: http://arxiv.org/abs/2603.07520v1
published_at: '2026-03-08T08:18:42'
authors:
- Quanjun Zhang
- Chunrong Fang
- Haichuan Hu
- Yuan Zhao
- Weisong Sun
- Yun Yang
- Tao Zheng
- Zhenyu Chen
topics:
- automated-program-repair
- patch-correctness-assessment
- code-representation
- graph-neural-networks
- code-intelligence
relevance_score: 0.86
run_id: materialize-outputs
language_code: en
---

# On the Effectiveness of Code Representation in Deep Learning-Based Automated Patch Correctness Assessment

## Summary
This paper systematically evaluates the effectiveness of different code representations in deep learning-based automated patch correctness assessment (APCA). The core finding is that graph representations, especially CPG, are the most stable and perform best overall in determining whether a patch is overfitting. The study also shows that appropriately combining two types of representations can further improve performance, but combining too many representations is not necessarily effective.

## Problem
- The problem to solve is: **how to more accurately determine whether an APR-generated patch that appears to pass tests is truly correct, rather than an incorrect patch overfitting to the test suite**.
- This matters because APR has long been troubled by patch overfitting; the paper cites background showing that developers typically spend about **50%** of their time on debugging and fixing, while incorrect plausible patches increase manual inspection costs and reduce the practical usability of APR.
- Although learning-based APCA methods already exist, the most fundamental component—**code representation**—lacks systematic comparison, and the community does not clearly understand the respective strengths and weaknesses of heuristic / sequence / tree / graph representations or whether they are suitable for fusion.

## Approach
- The authors build and use a large-scale patch benchmark: **2,274** labeled plausible patches from **Defects4J**, generated by **30+** repair tools, for unified evaluation of APCA.
- They systematically compare **4 categories and 15 code representations**: heuristic-based, sequence-based, tree-based, graph-based; and evaluate them with **11** classification models, training **500+** APCA models in total.
- Put simply, the method is: convert the “buggy code + patched code” into different forms of representation (hand-crafted features, token sequences, ASTs, program graphs), then train a binary classifier to determine whether the patch is correct or overfitting.
- The graph representation part covers **CFG/CDG/DDG/PDG/CPG**, and uses GNNs (such as GCN/GAT/GGNN) to learn patch semantics; the authors also analyze the roles of “textual information vs. type information” in graph node embeddings.
- They further conduct representation fusion experiments to examine whether combining two or more categories of representations can predict patch correctness better than a single representation.

## Results
- In RQ1, among the four categories, **graph representations perform best**. The accuracy of the best combination in each category is: **XGBoost+TF-IDF 80.41%**, **Transformer+sequence 82.48%**, **TreeLSTM+AST 82.94%**, **GGNN+CPG 83.73%**.
- The overall graph-representation result reported in the abstract shows that **CPG** achieves an **average accuracy of 82.69%** across three GNN models, indicating that this representation, which has previously received relatively little systematic study, is the most stable.
- In comparison with existing SOTA APCA methods, the average overall performance of the four categories is **80.55% / 82.90% / 83.03% / 83.81%** (corresponding to heuristic / sequence / tree / graph). Among them, **CPG+GGNN** improves over **Tian et al.'s BERT+SVM** by **9.34% / 14.96% / 8.83%** on **accuracy / recall / F1**, respectively.
- The paper claims that its method can match or surpass existing APCA approaches: for example, **TreeLSTM+AST** can filter out **87.09%** of overfitting patches.
- For representation fusion, **integrating sequence-based representations into heuristic-based representations** can bring an **average improvement of 13.58%** across **5 metrics**, making it one of the most notable gains.
- But more fusion is not always better: adding **tree-based** representations to the existing **heuristic+sequence** combination instead causes an **average decline of 3.34%**, showing that multi-representation fusion still has clear limitations and room for further research.

## Link
- [http://arxiv.org/abs/2603.07520v1](http://arxiv.org/abs/2603.07520v1)
