Recoleta Item Note

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

rtl-generationverilogllm-evaluationhardware-synthesiscode-generation

Summary

This paper proposes a “synthesis-in-the-loop” evaluation framework for RTL generation, using the HQI metric—composed of post-synthesis area, delay, and warnings—to systematically evaluate the real hardware usability of 32 LLMs on 202 Verilog tasks. The main conclusion is that looking only at simulation pass rates significantly overestimates model capability; while frontier models are already approaching expert-level quality, single-run deployment stability and synthesis failure modes remain the main bottlenecks.

Problem

Most existing RTL generation evaluations look only at syntax or whether simulation passes, and cannot measure whether the code is synthesizable or the resulting hardware quality after synthesis.
For chip design, functional correctness alone is not enough; if area/delay degrades severely, or the design cannot be mapped to gate-level circuits at all, the generated result cannot be used in a real production flow.
Therefore, a unified evaluation method is needed that covers “syntax → synthesis → functionality → QoR,” avoiding misjudgment of hardware generation capability by software-code evaluation paradigms.

Approach

The authors build a staged evaluation pipeline: first checking Verilog syntax (Icarus Verilog), then performing synthesis (Yosys + Nangate45 45nm), and finally running testbenches to verify functional correctness.
They propose HQI (Hardware Quality Index), ranging from 0 to 100; only designs that pass syntax, synthesis, and functionality all at once receive a score, and quality is computed from area, delay, and warning count relative to expert reference designs.
They evaluate 32 models on 202 tasks (from VerilogEval and RTLLM), with 5 independent samples for each model-task pair, while also reporting complexity-weighted Coverage, Global HQI (best-of-5), and ExpHQI (single-attempt expected quality).
They design a tool-adjudicated taxonomy of synthesis failures, diagnosing samples that pass parsing but fail at the Yosys stage into nine categories to analyze systematic failure mechanisms across models.
They additionally re-synthesize across 3 technology libraries to verify whether model rankings are robust to process variation.

Results

In the evaluation of 32 models, 202 tasks, and 5 attempts per task, the models form a three-tier structure: Tier 1 has 13 models (Global HQI >71), Tier 2 has 11 (53–68), and Tier 3 has 8 (<53).
The strongest model is Gemini-3-Pro, reaching 87.5% Coverage and 85.1 Global HQI; it is followed by GPT-5.4-Pro 81.3, Gemini-3-Flash 81.2, GPT-5.3-Codex 80.8, and GPT-5-Pro 80.5. The weakest model, Mistral-Nemo, achieves only 18.1 Global HQI; the paper states that the gap in hardware implementation quality between the strongest and weakest is about 4.7×.
Looking only at simulation overestimates hardware readiness: across all models, the best-of-5 pass rate is on average 7.5 points higher than Global HQI; for example, GPT-4.1 has 76.7% pass vs. 62.8 HQI (a gap of 13.9), and Gemini-2.0-Flash has 54.5% pass vs. 39.6 HQI (a gap of 14.9).
There is a clear deployment stability gap: the difference between best-of-5 and single-attempt expected quality is 3.8–22.1 HQI points; even within Tier 1, the median gap is 8.2, indicating that even frontier models often fail to reach their capability ceiling in a single call.
Out of 32,320 total generations, 195 were genuine synthesis failures; the top three failure modes account for 76.6%: late syntax errors: 59 cases (30.0%), undefined module references: 50 cases (25.4%), and non-synthesizable constructs: 41 cases (20.8%).
Failure modes diverge significantly by model type: proprietary models more often “fail late,” with 46% of their failures being late syntax errors in the elaboration stage, and synthesis timeout 12% appearing only in proprietary models; open-weight models more often “fail early,” with undefined module 40%, non-synthesizable 29%, and simulation-only system tasks 13%, pointing to training data biased more toward simulation-grade RTL than synthesis-grade RTL. Another robustness result is that model rankings are almost unchanged across 3 technology libraries, with Spearman ρ > 0.99.

Link

http://arxiv.org/abs/2603.11287v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart