"A benchmark for catching when code doesn't do what its documentation claims"

A new open-source benchmark, Truth Benchmark, automatically detects when code does not match its documentation claims. The project provides a dataset of 52 labeled examples across multiple programming languages and baseline models, including rule-based, semantic, and LLM approaches, with the best accuracy reaching 75%. It aims to catch documentation drift that can lead to security vulnerabilities and AI training data inaccuracies.

Does the code actually do what it says? Software documentation makes claims. Code tells the truth. This project builds tools to catch the gap — automatically. Every piece of software has two parts: What it says it does — documentation, README, comments, AI-generated descriptions What it actually does — the code These drift apart constantly. A developer fixes the code, forgets the docs. A security claim stays in the README after the protection was removed. An AI describes what code should do, not what it does . This causes real harm. Security teams audit documentation, not code. AI models trained on documentation inherit the lies. This project builds a benchmark to catch it. Not a developer? See PLAIN ENGLISH.md — the problem explained without code. git clone https://github.com/02zerocool/truth-benchmark cd truth-benchmark pip install pandas python pipeline.py Test a single claim: python predict.py "deletes files" "os.remove path " Result: MATCH python predict.py "encrypts password before storing" "db.save user, password " Result: LIE Verified on the full 52-example dataset. These are the numbers to beat. | Baseline | Accuracy | MATCH F1 | LIE F1 | Notes | |---|---|---|---|---| | Rules no ML | 63.5% | 0.72 | 0.46 | Zero dependencies. The floor. | | Semantic all-MiniLM-L6-v2 | 71.2% | 0.74 | 0.68 | 80MB model, CPU-friendly. | | LLM llama3:8b via Ollama | 75.0% | 0.79 | 0.68 | Best accuracy. Misses subtle numeric lies. | The gap between rules and LLM is 11.5 points. The gap between LLM and perfect is 25 points. The hardest cases: wrong constants, off-by-one errors, wrong sort direction, missing auth checks. python baseline rules.py Heuristic pattern matching. No ML. Runs anywhere. This is the floor — beat it. pip install sentence-transformers python baseline semantic.py Embeds claim and code with all-MiniLM-L6-v2 , measures cosine similarity. Downloads model automatically on first run. With Ollama local, free, offline : ollama pull llama3.1:8b python baseline llm.py With any OpenAI-compatible API: LLM API URL=https://api.openai.com/v1 LLM API KEY=your key LLM MODEL=gpt-4o-mini python baseline llm.py Plug in your own model with two lines: python from evaluate import evaluate, print report def my verifier claim: str, code: str - str: your model here return "MATCH" or "LIE" results = evaluate my verifier print report results, name="My Model" Output: ================================================== My Model ================================================== Total examples : 52 Correct : 38 Accuracy : 73.1% MATCH precision=0.71 recall=0.89 f1=0.79 LIE precision=0.79 recall=0.54 f1=0.64 Failures: truth=LIE predicted=MATCH claim : returns unique items preserving order code : return list set items ================================================== dataset.csv — 52 labeled examples across Python, JavaScript, Go, Rust, Java, C , SQL. | claim | code | label | |---|---|---| | deletes files | os.remove path | MATCH | | deletes files | open path,'w' .write '' | LIE | | encrypts password before storing | db.save user, bcrypt.hash password, 12 | MATCH | | encrypts password before storing | db.save user, password | LIE | | sends data over encrypted connection | requests.get 'https://' + url | MATCH | | sends data over encrypted connection | requests.get 'http://' + url | LIE | Coverage includes: - Wrong operators and constants - Missing encryption, validation, authentication - Wrong sort direction, wrong protocol - Subtle traps: byte length vs character length, set vs dict.fromkeys for order-preserving uniqueness, soft-delete vs hard-delete, indexOf 0 missing index 0 Add examples. Any language. Any domain. All pull requests welcome. What makes a good contribution: - Both a MATCH and a LIE for the same claim - The LIE is plausible — something a real developer might write - Claim is plain English, code is a short snippet Most wanted — subtle lies: Cases where the code almost does the right thing but doesn't quite. These are the hardest to catch and most dangerous in production. Most wanted — security examples: Missing encryption, missing validation, wrong protocol, missing auth check. Documentation that lies about security is a vulnerability. See CONTRIBUTING.md /02zerocool/truth-benchmark/blob/main/CONTRIBUTING.md for format details. truth-benchmark/ ├── dataset.csv Labeled examples ├── pipeline.py Bare minimum: load and inspect the dataset ├── evaluate.py Evaluation framework — plug in any verifier ├── baseline rules.py Baseline 1: heuristic rules, no ML ├── baseline semantic.py Baseline 2: semantic similarity ├── baseline llm.py Baseline 3: LLM-based Ollama or API ├── predict.py CLI: verify a single claim/code pair ├── requirements.txt Dependencies pandas required, rest optional ├── PLAIN ENGLISH.md Explanation for non-developers └── CONTRIBUTING.md How to add examples The same gap exists everywhere: | What they say | What actually happens | |---|---| | "Your data is deleted on request" | Retained in backups for years | | "This AI has no bias" | Trained on curated data with known gaps | | "This system is independently audited" | Audited by a subsidiary | | "Encrypted end-to-end" | Encrypted in transit, plain text at rest | Code is verifiable. Documentation is a claim. We are building a tool that checks claims against evidence. That is a universal need. MIT — free to use, fork, extend, deploy. No restrictions. Built with Seven Claude Sonnet 4.6 The framework design, dataset construction, all three baselines, and documentation were developed in active collaboration with Seven. The problem statement and direction came from the human. The implementation was built together.