Local Success Does Not Compose

Benchmarking Large Language Models for Compositional Formal Verification

The first benchmark explicitly designed to evaluate LLMs on compositional specification generation, revealing a striking 92% performance gap between syntax correctness and verification success.

300 Verified Programs
13 SOTA Models
92% Performance Gap
The Compositional Gap
99%
7%

Abstract

Despite rapid advances in code generation, current Large Language Models (LLMs) still lack an essential capability for reliable and verifiable code generation: compositional reasoning across multi-function programs.

To explore this potential and important gap, we introduce DafnyComp, a benchmark designed to systematically evaluate LLMs on the generation of compositional specifications in Dafny. Unlike prior benchmarks that primarily target single-function annotation, DafnyComp focuses on programs composed of multiple interacting functions with necessary data dependencies.

Key Findings

  • Syntax vs. Semantics Gap: 99% syntax correctness but only 7% verification success
  • Universal Failure: All 13 evaluated SOTA models show similar compositional breakdown
  • Three Failure Modes: Specification fragility (39.2%), implementation-proof misalignment (21.7%), and reasoning instability (14.1%)

Specification Fragility

39.2%

Contract propagation failures

Implementation-Proof Misalignment

21.7%

Independent generation pathways

Reasoning Instability

14.1%

Inductive chain breakdown

The DafnyComp Benchmark

300 Programs

Mechanically verified multi-function Dafny programs

2-5 Functions

Chain-based compositions with data dependencies

15 Categories

Spanning dynamic programming, graphs, strings, and more

7 Invariants

Median loop invariants per program (3.5× DafnyBench)

Two-Stage Synthesis Pipeline

1

Program Assembly

Systematic combination of functions from LeetCodeDataset with controlled data flow

  • McCabe complexity filtering (>5)
  • Chain-based composition
  • Type constraint propagation
  • Test validation pipeline
2

Formal Translation

Incremental AST-guided transformation to verified Dafny

  • Fragment-by-fragment translation
  • Immediate verification
  • Up to 10 refinement iterations
  • 47% overall success rate

Evaluation Task Format

Given (Stripped Contracts)

method digitSum(n: int) returns (sum: int)
{
    var temp := n;
    sum := 0;
    while temp > 0 {
        sum := sum + (temp % 10);
        temp := temp / 10;
    }
}

Required (Complete Specification)

method digitSum(n: int) returns (sum: int)
    requires n >= 0
    ensures sum >= 0
    ensures sum == /* digit sum of n */
{
    var temp := n;
    sum := 0;
    while temp > 0
        invariant temp >= 0
        invariant sum >= 0
        decreases temp
    {
        sum := sum + (temp % 10);
        temp := temp / 10;
    }
}

Models must regenerate requires, ensures, reads, modifies, and decreases clauses to enable verification across function boundaries.

Experimental Results

The Verification Collapse

Despite achieving near-perfect syntax correctness (99%), all evaluated models exhibit catastrophic verification failure, with the best model reaching only 7% verification success.

13 State-of-the-Art Models Evaluated

OpenAI

  • GPT-4o
  • GPT-4.1
  • o4-mini

Anthropic

  • Claude-3.5-Sonnet
  • Claude-4-Sonnet

Google

  • Gemini-2.5-Pro
  • Gemini-2.5-Flash

DeepSeek

  • DeepSeek-R1
  • DeepSeek-V3
  • DeepSeek-V3.1

Alibaba

  • Qwen2.5-Coder-32B
  • Qwen3-Coder-480B
  • QwQ-32B

Performance Comparison: Single vs. Multi-Function

DafnyBench (Single Function)

58%

Moderate verification success on isolated functions

VS

DafnyComp (Multi-Function)

7%

Catastrophic failure on compositional tasks

3.2× Function Increase
14.4× Performance Decrease
Super-linear complexity scaling reveals compositional breakdown

Four Key Observations

1

Universal Verification Failure

92-percentage-point gap between syntax and verification success across all models, independent of scale, specialization, or architecture.

2

Non-linear Scaling

Super-linear complexity scaling suggests combinatorial specification dependencies that models fail to capture.

3

Sampling Saturation

Verification performance plateaus by Pass@4, indicating architectural rather than search limitations.

4

No Reasoning Advantage

Specialized reasoning models show no improvement, confirming architectural barriers rather than optimization issues.

Failure Analysis & Insights

Through systematic analysis of 900 verification failures, we identified three primary failure modes that reveal fundamental limitations in how transformers process compositional specifications.

Specification Fragility

39.2% 353/900 cases

The Domino Effect: Missing contract propagation causes cascading failures.

Example Scenario

A digitSum function missing ensures result >= 0 appears correct in isolation but invalidates entire pipeline when output feeds downstream function expecting non-negative input.

Key Insight: LLMs learn specifications as local patterns rather than global contracts, lacking architectural machinery for cross-boundary reasoning.

Implementation-Proof Misalignment

21.7% 195/900 cases

The Independence Assumption: Code and specifications generated as independent tasks.

Example Scenario

Loop invariants like forall k :: 0 <= k < i ==> cnt[k] >= 0 appear plausible but fail verification due to implementation's actual array access patterns.

Key Insight: Transformer attention mechanisms excel at local dependencies but struggle with bidirectional constraints between specifications and implementations.

Reasoning Instability

14.1% 127/900 cases

Induction as Achilles' Heel: Failure to maintain inductive reasoning chains.

Example Scenario

Loop invariants break because models cannot track how program state evolves through iterations. Properties proven for base cases fail to extend inductively.

Key Insight: Statistical approximation is insufficient for formal verification, which demands precise inductive proofs rather than pattern matching.

Broader Implications

Architectural Limitation

Current transformer architectures lack the inductive biases necessary for compositional reasoning in formal verification.

Training Paradigm

Existing training methods fail to enforce consistency between code and specifications, treating them as independent generation tasks.

Scaling Challenges

Simply increasing model size or compute does not address the fundamental compositional reasoning deficit.

Citation

@article{dafnycomp2025,
    title={Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification},
    author={Xu Xu and Xin Li and Xingwei Qu and Jie Fu and Binhang Yuan},
    journal={ArXiv},
    year={2025},
}