The first benchmark explicitly designed to evaluate LLMs on compositional specification generation, revealing a striking 92% performance gap between syntax correctness and verification success.
Despite rapid advances in code generation, current Large Language Models (LLMs) still lack an essential capability for reliable and verifiable code generation: compositional reasoning across multi-function programs.
To explore this potential and important gap, we introduce DafnyComp, a benchmark designed to systematically evaluate LLMs on the generation of compositional specifications in Dafny. Unlike prior benchmarks that primarily target single-function annotation, DafnyComp focuses on programs composed of multiple interacting functions with necessary data dependencies.
39.2%
Contract propagation failures21.7%
Independent generation pathways14.1%
Inductive chain breakdownMechanically verified multi-function Dafny programs
Chain-based compositions with data dependencies
Spanning dynamic programming, graphs, strings, and more
Median loop invariants per program (3.5× DafnyBench)
Systematic combination of functions from LeetCodeDataset with controlled data flow
Incremental AST-guided transformation to verified Dafny
method digitSum(n: int) returns (sum: int)
{
var temp := n;
sum := 0;
while temp > 0 {
sum := sum + (temp % 10);
temp := temp / 10;
}
}
method digitSum(n: int) returns (sum: int)
requires n >= 0
ensures sum >= 0
ensures sum == /* digit sum of n */
{
var temp := n;
sum := 0;
while temp > 0
invariant temp >= 0
invariant sum >= 0
decreases temp
{
sum := sum + (temp % 10);
temp := temp / 10;
}
}
Models must regenerate requires
, ensures
, reads
,
modifies
, and decreases
clauses to enable verification across
function boundaries.
Despite achieving near-perfect syntax correctness (99%), all evaluated models exhibit catastrophic verification failure, with the best model reaching only 7% verification success.
Moderate verification success on isolated functions
Catastrophic failure on compositional tasks
92-percentage-point gap between syntax and verification success across all models, independent of scale, specialization, or architecture.
Super-linear complexity scaling suggests combinatorial specification dependencies that models fail to capture.
Verification performance plateaus by Pass@4, indicating architectural rather than search limitations.
Specialized reasoning models show no improvement, confirming architectural barriers rather than optimization issues.
Through systematic analysis of 900 verification failures, we identified three primary failure modes that reveal fundamental limitations in how transformers process compositional specifications.
The Domino Effect: Missing contract propagation causes cascading failures.
A digitSum
function missing ensures result >= 0
appears correct in isolation
but invalidates entire pipeline when output feeds downstream function expecting non-negative input.
Key Insight: LLMs learn specifications as local patterns rather than global contracts, lacking architectural machinery for cross-boundary reasoning.
The Independence Assumption: Code and specifications generated as independent tasks.
Loop invariants like forall k :: 0 <= k < i ==> cnt[k] >= 0
appear plausible
but fail verification due to implementation's actual array access patterns.
Key Insight: Transformer attention mechanisms excel at local dependencies but struggle with bidirectional constraints between specifications and implementations.
Induction as Achilles' Heel: Failure to maintain inductive reasoning chains.
Loop invariants break because models cannot track how program state evolves through iterations. Properties proven for base cases fail to extend inductively.
Key Insight: Statistical approximation is insufficient for formal verification, which demands precise inductive proofs rather than pattern matching.
Current transformer architectures lack the inductive biases necessary for compositional reasoning in formal verification.
Existing training methods fail to enforce consistency between code and specifications, treating them as independent generation tasks.
Simply increasing model size or compute does not address the fundamental compositional reasoning deficit.
@article{dafnycomp2025, title={Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification}, author={Xu Xu and Xin Li and Xingwei Qu and Jie Fu and Binhang Yuan}, journal={ArXiv}, year={2025}, }