Recoleta Item Note

Quality-Driven Agentic Reasoning for LLM-Assisted Software Design: Questions-of-Thoughts (QoT) as a Time-Series Self-QA Chain

llm-agentssoftware-designinference-time-reasoningself-verificationcode-quality

Summary

This paper proposes QoT (Questions-of-Thoughts), a quality-oriented reasoning scaffold for software design that has an LLM first decompose engineering steps and then perform self-questioning checks for each step. It aims to reduce omissions, improve modularity and security, and leave behind a reusable lightweight reasoning record.

Problem

Existing LLM-assisted software development can often generate code that "looks usable," but it is frequently inadequate in completeness, modularity, and security, especially in multi-module, long-chain tasks where key constraints are easily missed.
Checking only whether functionality runs is not enough, because real software systems also need to be maintainable, auditable, and deployable; this is especially important for backend systems, enterprise workflows, and compliance scenarios.
Existing CoT/ToT/self-correction methods usually focus more on "generate first, then revise," and lack a front-loaded constraint-organizing and step-by-step verification mechanism centered on software quality attributes.

Approach

QoT first breaks the user goal into ordered engineering steps (Sequential Process Chain), for example, user modules first, then business modules, then routing and integration, to avoid missing dependency relationships during one-shot generation.
For each step, the model automatically raises a set of self-check questions (Question-Answer Chain). Put simply, it is "asking itself while working: Is there access control? Is there error handling? Are there concurrency/consistency issues?"
The system continuously writes intermediate conclusions into a Reasoning Knowledge Base, which serves as context for later steps and helps subsequent design stay consistent with earlier constraints.
This method is an inference-time enhancement rather than training a new model: the base model remains unchanged, and only a quality-driven agentic scaffold is wrapped around the inference process.
In evaluation, the authors use an ISO/IEC-inspired quality rubric to score Scalability, Completeness, Modularity, and Security on a 1–4 scale, and compare differences among QoT, NoQoT, and CoT.

Results

In the QoT vs CoT comparison, llama3.1_70b shows the clearest improvement: API Design +5.8±1.30, Data Communication +6.6±0.89, File Systems +3.2±1.48.
In QoT vs CoT, llama3.3_70b is also positive across all three domains: API +2.2±2.28, Data Communication +4.8±2.17, File Systems +2.2±3.90.
Smaller models also benefit but are less stable: llama3.1_8b improves over CoT by API +2.0±1.73, Communication +2.4±3.05, FS +1.2±2.77; llama3.2_3b achieves API +3.6±2.51, Communication +1.4±1.67, FS +1.4±5.86.
In QoT vs NoQoT, the results show capacity dependence and task dependence: for example, llama3.1_70b gets API +3.4±1.34, Communication +5.4±1.67, but File Systems -2.8±1.10; llama3.3_70b also shows -3.0±3.46 in FS, which the authors interpret as possible "overthinking/over-engineering."
The percentage summary shown in the figure indicates that llama3.2_3b reaches a total improvement of 101.49% under QoT vs NoQoT, while llama3.1_70b is 23.08%, llama3.1_8b is 23.81%, and llama3.3_70b is 2.80%.
The paper's central breakthrough claim is that QoT can significantly improve software design quality through "stepwise planning + self-check Q&A + cumulative memory" without changing model parameters, and in some scenarios allows smaller models to approach the one-shot generation quality of larger models.

Link

http://arxiv.org/abs/2603.11082v1

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart