DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

wpnews.pro

cd /news/large-language-models/defab-a-verifiable-benchmark-for-def… · home › topics › large-language-models › article

[ARTICLE · art-32056] src=arxiv.org ↗ pub=2026-06-18T04:00Z topic=large-language-models verified=true sentiment=· neutral

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Researchers introduced DeFAb, a benchmark for defeasible abduction in foundation models, converting knowledge bases into 372,648+ logically verifiable instances. Frontier language models achieved at most 65% accuracy, dropping to 23.5% under robust evaluation, while a symbolic solver reached 100% in microseconds. The benchmark aims to measure disciplined theory revision rather than fluent prose, with applications as a reward signal for preference optimization.

read1 min views1 publishedJun 18, 2026

arXiv:2606.18557v1 Announce Type: new Abstract: A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/defab-a-verifiable-bench…

Read original on arxiv.org → arxiv.org/abs/2606.18557

mentioned entities

OpenCyc

YAGO

Wikidata

ConceptNet

UMLS

Lean 4

Mathlib

DeFAb

metadata

slugdefab-a-verifiable-benchmark-for-defeasible-abduction-in-foundation-models

topic#large-language-models

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevIs AI Getting Quietly Dumber? A …

next →Most agentic AI projects in prod…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 18 Jun · #large-language-models

It was never about AI. It has always been about narrative control.

actu.epfl.ch · 18 Jun · #large-language-models

EPFL launches the first open medical LLMs

cityam.com · 18 Jun · #large-language-models

City law firms ‘sleepwalking into a crisis’ over AI overreliance

insinuator.net · 18 Jun · #large-language-models

Vulnerability Disclosure: Stealing Emails via Firefox's AI Features

── more on @opencyc 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 17 Jun · #artificial-intelligence

How I Build Production AI Apps on Cloudflare with Claude Code

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required