DafnyComp: Benchmarking Compositional Formal Verification

Abstract

Despite rapid advances in code generation, current Large Language Models (LLMs) still lack an essential capability for reliable and verifiable code generation: compositional reasoning across multi-function programs.

To explore this potential and important gap, we introduce DafnyComp, a benchmark designed to systematically evaluate LLMs on the generation of compositional specifications in Dafny. Unlike prior benchmarks that primarily target single-function annotation, DafnyComp focuses on programs composed of multiple interacting functions with necessary data dependencies.

Key Findings

Syntax vs. Semantics Gap: 99% syntax correctness but only 7% verification success
Universal Failure: All 13 evaluated SOTA models show similar compositional breakdown
Three Failure Modes: Specification fragility (39.2%), implementation-proof misalignment (21.7%), and reasoning instability (14.1%)

Specification Fragility

39.2%

Contract propagation failures

Implementation-Proof Misalignment

21.7%

Independent generation pathways

Reasoning Instability

14.1%

Inductive chain breakdown

The DafnyComp Benchmark

300 Programs

Mechanically verified multi-function Dafny programs

2-5 Functions

Chain-based compositions with data dependencies

15 Categories

Spanning dynamic programming, graphs, strings, and more

7 Invariants

Median loop invariants per program (3.5× DafnyBench)

Two-Stage Synthesis Pipeline

1

Program Assembly

Systematic combination of functions from LeetCodeDataset with controlled data flow

McCabe complexity filtering (>5)
Chain-based composition
Type constraint propagation
Test validation pipeline

2

Formal Translation

Incremental AST-guided transformation to verified Dafny

Fragment-by-fragment translation
Immediate verification
Up to 10 refinement iterations
47% overall success rate

Evaluation Task Format

Given (Stripped Contracts)

method digitSum(n: int) returns (sum: int)
{
    var temp := n;
    sum := 0;
    while temp > 0 {
        sum := sum + (temp % 10);
        temp := temp / 10;
    }
}

Required (Complete Specification)

method digitSum(n: int) returns (sum: int)
    requires n >= 0
    ensures sum >= 0
    ensures sum == /* digit sum of n */
{
    var temp := n;
    sum := 0;
    while temp > 0
        invariant temp >= 0
        invariant sum >= 0
        decreases temp
    {
        sum := sum + (temp % 10);
        temp := temp / 10;
    }
}

Models must regenerate requires, ensures, reads, modifies, and decreases clauses to enable verification across function boundaries.

Experimental Results

The Verification Collapse

Despite achieving near-perfect syntax correctness (99%), all evaluated models exhibit catastrophic verification failure, with the best model reaching only 7% verification success.

13 State-of-the-Art Models Evaluated

OpenAI

GPT-4o
GPT-4.1
o4-mini

Anthropic

Claude-3.5-Sonnet
Claude-4-Sonnet

Google

Gemini-2.5-Pro
Gemini-2.5-Flash

DeepSeek

DeepSeek-R1
DeepSeek-V3
DeepSeek-V3.1

Alibaba

Qwen2.5-Coder-32B
Qwen3-Coder-480B
QwQ-32B

Performance Comparison: Single vs. Multi-Function

DafnyBench (Single Function)

58%

Moderate verification success on isolated functions

VS

DafnyComp (Multi-Function)

7%

Catastrophic failure on compositional tasks

3.2× Function Increase

→

14.4× Performance Decrease

Super-linear complexity scaling reveals compositional breakdown

Four Key Observations

1

Universal Verification Failure

92-percentage-point gap between syntax and verification success across all models, independent of scale, specialization, or architecture.

2

Non-linear Scaling

Super-linear complexity scaling suggests combinatorial specification dependencies that models fail to capture.

3

Sampling Saturation

Verification performance plateaus by Pass@4, indicating architectural rather than search limitations.

4

No Reasoning Advantage

Specialized reasoning models show no improvement, confirming architectural barriers rather than optimization issues.

Failure Analysis & Insights

Through systematic analysis of 900 verification failures, we identified three primary failure modes that reveal fundamental limitations in how transformers process compositional specifications.

Specification Fragility

39.2% 353/900 cases

The Domino Effect: Missing contract propagation causes cascading failures.

Example Scenario

A digitSum function missing ensures result >= 0 appears correct in isolation but invalidates entire pipeline when output feeds downstream function expecting non-negative input.

Key Insight: LLMs learn specifications as local patterns rather than global contracts, lacking architectural machinery for cross-boundary reasoning.

Implementation-Proof Misalignment

21.7% 195/900 cases

The Independence Assumption: Code and specifications generated as independent tasks.

Example Scenario

Loop invariants like forall k :: 0 <= k < i ==> cnt[k] >= 0 appear plausible but fail verification due to implementation's actual array access patterns.

Key Insight: Transformer attention mechanisms excel at local dependencies but struggle with bidirectional constraints between specifications and implementations.

Reasoning Instability

14.1% 127/900 cases

Induction as Achilles' Heel: Failure to maintain inductive reasoning chains.

Example Scenario

Loop invariants break because models cannot track how program state evolves through iterations. Properties proven for base cases fail to extend inductively.

Key Insight: Statistical approximation is insufficient for formal verification, which demands precise inductive proofs rather than pattern matching.

Broader Implications

Architectural Limitation

Current transformer architectures lack the inductive biases necessary for compositional reasoning in formal verification.

Training Paradigm

Existing training methods fail to enforce consistency between code and specifications, treating them as independent generation tasks.

Scaling Challenges

Simply increasing model size or compute does not address the fundamental compositional reasoning deficit.

Local Success Does Not Compose

Benchmarking Large Language Models for Compositional Formal Verification

Abstract

Key Findings

Specification Fragility

Implementation-Proof Misalignment

Reasoning Instability

The DafnyComp Benchmark

300 Programs

2-5 Functions

15 Categories

7 Invariants

Two-Stage Synthesis Pipeline

Program Assembly

Formal Translation

Evaluation Task Format

Given (Stripped Contracts)

Required (Complete Specification)

Experimental Results

The Verification Collapse

13 State-of-the-Art Models Evaluated

OpenAI

Anthropic

Google

DeepSeek

Alibaba

Performance Comparison: Single vs. Multi-Function

DafnyBench (Single Function)

DafnyComp (Multi-Function)

Four Key Observations

Universal Verification Failure

Non-linear Scaling

Sampling Saturation

No Reasoning Advantage

Failure Analysis & Insights

Specification Fragility

Example Scenario

Implementation-Proof Misalignment

Example Scenario

Reasoning Instability

Example Scenario

Broader Implications

Architectural Limitation

Training Paradigm

Scaling Challenges

Citation