From Verification to Herding: Exploiting Software's Sparsity of Influence
This paper argues for shifting software testing/verification from “model first, then prove” to sampling-based herding, based on the key premise that software is often controlled by only a small number of critical…
Summary
This paper argues for shifting software testing/verification from “model first, then prove” to sampling-based herding, based on the key premise that software is often controlled by only a small number of critical variables. The authors propose the black-box optimization algorithm EZR and claim that it can approach optimal performance on a large set of software engineering tasks using very few samples.
Problem
- Software verification and validation (V&V) is extremely costly, reportedly accounting for 60% of project effort, and exhaustive verification is becoming increasingly infeasible in the presence of AI components, concurrency, and distributed systems.
- Traditional methods such as symbolic execution, model checking, ASP, and probabilistic programming depend on complex models; the authors argue that “building another model of the software” may itself be more expensive and more brittle than testing directly.
- The key question is: without understanding the system’s full internal mechanism, can we identify the small set of “control variables” that govern system behavior using only limited input/output sampling, and thereby steer the system toward a target state?
Approach
- The core idea is sparsity of influence: although software state spaces are huge, the key variables that actually determine outcomes are often few in number, commonly as few as under 10, so exhaustive search is unnecessary.
- The authors generalize testing as a black-box optimization/search problem: instead of proving that “all inputs are correct,” the goal is to find some input configuration that brings the system to an acceptable or even near-ideal target (called Heaven in the paper).
- They propose EZR (Efficient Zero-knowledge Ranker): first randomly sample a small number of configurations, rank them by loss relative to the “ideal point,” and split the samples into BEST and REST groups.
- EZR discretizes each input attribute, looks for attribute ranges that occur frequently in BEST and infrequently in REST, and uses a simple two-class Bayesian-style score to select rules that best distinguish good outcomes.
- It then uses these high-scoring rules to constrain the next round of sampling, continually “herding” the search toward better regions. Compared with methods like TPE/SMAC, it avoids rebuilding complex models and instead uses incremental updates, making it lighter-weight.
Results
- The paper’s central experimental result is that on 63 tasks from the MOOT repository, EZR reaches an average of 90% optimality (relative to a reference optimum) using only 32 samples.
- The relationship between sampling budget and performance is: 8 samples → 62%, 16 samples → 80%, 32 samples → 90%, 64 samples → 91%, 128 samples → 92%; the authors emphasize a clear diminishing-returns knee at 32.
- The dataset covers many categories of software engineering tasks, including system configuration, performance tuning, cloud deployment, Scrum planning, risk modeling, and test-case reduction; input dimensionality ranges from 3–88, to 124, and even 1000 decision variables (SPL).
- In Table 2, the average Opt% at N=32 for each task family is approximately: system configuration 91%, performance tuning 89%, cloud 92%, Scrum 88%, risk 94%, health 90%, SPL 85%, finance 93%, talent 89%, security 91%, sales 87%, control 95%, and test reduction 92%.
- The paper also claims that in comparisons with SMAC, OPTUNA, DEHB, Random, KPP, EZR “matches or outperforms” these methods; Figure 1 suggests its statistical performance is usually no worse than SMAC, but the main text does not provide a single unified win-rate table.
- On speed, the authors provide a qualitative/semi-quantitative claim: over 20 runs on the full dataset, EZR finishes in minutes, whereas SMAC requires days, but the excerpt does not include more precise absolute timing numbers.
Link
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.