"A benchmark for catching when code doesn't do what its documentation claims"

wpnews.pro

cd /news/ai-safety/a-benchmark-for-catching-when-code-d… · home › topics › ai-safety › article

[ARTICLE · art-27168] src=github.com ↗ pub=2026-06-14T18:25Z topic=ai-safety verified=true sentiment=· neutral

"A benchmark for catching when code doesn't do what its documentation claims"

A new open-source benchmark, Truth Benchmark, automatically detects when code does not match its documentation claims. The project provides a dataset of 52 labeled examples across multiple programming languages and baseline models, including rule-based, semantic, and LLM approaches, with the best accuracy reaching 75%. It aims to catch documentation drift that can lead to security vulnerabilities and AI training data inaccuracies.

read4 min views23 publishedJun 14, 2026

Does the code actually do what it says?

Software documentation makes claims. Code tells the truth.

This project builds tools to catch the gap — automatically.

Every piece of software has two parts:

What it says it does— documentation, README, comments, AI-generated descriptions** What it actually does**— the code

These drift apart constantly. A developer fixes the code, forgets the docs. A security claim stays in the README after the protection was removed. An AI describes what code should do, not what it does.

This causes real harm. Security teams audit documentation, not code. AI models trained on documentation inherit the lies.

This project builds a benchmark to catch it.

Not a developer? See

[PLAIN_ENGLISH.md]— the problem explained without code.

git clone https://github.com/02zerocool/truth-benchmark
cd truth-benchmark
pip install pandas
python pipeline.py

Test a single claim:

python predict.py "deletes files" "os.remove(path)"

python predict.py "encrypts password before storing" "db.save(user, password)"

Verified on the full 52-example dataset. These are the numbers to beat.

Baseline	Accuracy	MATCH F1	LIE F1	Notes
Rules (no ML)	63.5%	0.72	0.46	Zero dependencies. The floor.
Semantic (all-MiniLM-L6-v2)	71.2%	0.74	0.68	80MB model, CPU-friendly.
LLM (llama3:8b via Ollama)	75.0%	0.79	0.68	Best accuracy. Misses subtle numeric lies.

The gap between rules and LLM is 11.5 points. The gap between LLM and perfect is 25 points.

The hardest cases: wrong constants, off-by-one errors, wrong sort direction, missing auth checks.

python baseline_rules.py

Heuristic pattern matching. No ML. Runs anywhere. This is the floor — beat it.

pip install sentence-transformers
python baseline_semantic.py

Embeds claim and code with all-MiniLM-L6-v2

, measures cosine similarity.

Downloads model automatically on first run.

ollama pull llama3.1:8b
python baseline_llm.py

LLM_API_URL=https://api.openai.com/v1 LLM_API_KEY=your_key LLM_MODEL=gpt-4o-mini python baseline_llm.py

Plug in your own model with two lines:

from evaluate import evaluate, print_report

def my_verifier(claim: str, code: str) -> str:
    return "MATCH" or "LIE"

results = evaluate(my_verifier)
print_report(results, name="My Model")

Output:

Total examples : 52 Correct : 38 Accuracy : 73.1%

[MATCH] precision=0.71 recall=0.89 f1=0.79 [LIE] precision=0.79 recall=0.54 f1=0.64

Failures: truth=LIE predicted=MATCH claim : returns unique items preserving order


`dataset.csv`

— 52 labeled examples across Python, JavaScript, Go, Rust, Java, C#, SQL.

| claim | code | label |
|---|---|---|
| deletes files | `os.remove(path)` |
MATCH |
| deletes files | `open(path,'w').write('')` |
LIE |
| encrypts password before storing | `db.save(user, bcrypt.hash(password, 12))` |
MATCH |
| encrypts password before storing | `db.save(user, password)` |
LIE |
| sends data over encrypted connection | `requests.get('https://' + url)` |
MATCH |
| sends data over encrypted connection | `requests.get('http://' + url)` |
LIE |

Coverage includes:

- Wrong operators and constants
- Missing encryption, validation, authentication
- Wrong sort direction, wrong protocol
- Subtle traps: byte length vs character length,
`set()`

vs`dict.fromkeys()`

for order-preserving uniqueness, soft-delete vs hard-delete,`indexOf > 0`

missing index 0

Add examples. Any language. Any domain. All pull requests welcome.

**What makes a good contribution:**

- Both a MATCH and a LIE for the same claim
- The LIE is plausible — something a real developer might write
- Claim is plain English, code is a short snippet

**Most wanted — subtle lies:**

Cases where the code almost does the right thing but doesn't quite.

These are the hardest to catch and most dangerous in production.

**Most wanted — security examples:**

Missing encryption, missing validation, wrong protocol, missing auth check.

Documentation that lies about security is a vulnerability.

See [CONTRIBUTING.md](/02zerocool/truth-benchmark/blob/main/CONTRIBUTING.md) for format details.

truth-benchmark/ ├── dataset.csv # Labeled examples ├── pipeline.py # Bare minimum: load and inspect the dataset ├── evaluate.py # Evaluation framework — plug in any verifier ├── baseline_rules.py # Baseline 1: heuristic rules, no ML ├── baseline_semantic.py # Baseline 2: semantic similarity ├── baseline_llm.py # Baseline 3: LLM-based (Ollama or API) ├── predict.py # CLI: verify a single claim/code pair ├── requirements.txt # Dependencies (pandas required, rest optional) ├── PLAIN_ENGLISH.md # Explanation for non-developers └── CONTRIBUTING.md # How to add examples


The same gap exists everywhere:

| What they say | What actually happens |
|---|---|
| "Your data is deleted on request" | Retained in backups for years |
| "This AI has no bias" | Trained on curated data with known gaps |
| "This system is independently audited" | Audited by a subsidiary |
| "Encrypted end-to-end" | Encrypted in transit, plain text at rest |

Code is verifiable. Documentation is a claim.

We are building a tool that checks claims against evidence.

That is a universal need.

MIT — free to use, fork, extend, deploy. No restrictions.

**Built with Seven (Claude Sonnet 4.6)**

The framework design, dataset construction, all three baselines, and documentation were developed in active collaboration with Seven. The problem statement and direction came from the human. The implementation was built together.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-benchmark-for-catching…

Read original on github.com → github.com/02zerocool/truth-benchmark

mentioned entities

Truth Benchmark

Ollama

all-MiniLM-L6-v2

llama3:8b

gpt-4o-mini

Python

JavaScript

SQL

metadata

sluga-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims

topic#ai-safety

secondary3 topics

sentimentneutral

canonicalgithub.com

navigation

← prevWelcome to the AGI era of AI gov…

next →Guardian Angels: LLM Personaliza…

── more in #ai-safety 4 stories · sorted by recency

github.com · 30 Jul · #ai-safety

OpenMetaHarness

dev.to · 30 Jul · #ai-safety

How to Audit Your MCP Servers for Security Risks

dev.to · 30 Jul · #ai-safety

Beyond Vibe Coding: Engineering Resilient AI Agents with FSMs, Privacy, and Cost Controls

promptcube3.com · 29 Jul · #ai-safety

Continue extension setup, Qwen Coder local setup

── more on @truth benchmark 3 stories trending now

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 29 Jul · #artificial-intelligence

Investors are selling Meta as it heads to its earnings report

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required