{"slug": "a-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims", "title": "\"A benchmark for catching when code doesn't do what its documentation claims\"", "summary": "A new open-source benchmark, Truth Benchmark, automatically detects when code does not match its documentation claims. The project provides a dataset of 52 labeled examples across multiple programming languages and baseline models, including rule-based, semantic, and LLM approaches, with the best accuracy reaching 75%. It aims to catch documentation drift that can lead to security vulnerabilities and AI training data inaccuracies.", "body_md": "**Does the code actually do what it says?**\n\nSoftware documentation makes claims. Code tells the truth.\n\nThis project builds tools to catch the gap — automatically.\n\nEvery piece of software has two parts:\n\n**What it says it does**— documentation, README, comments, AI-generated descriptions** What it actually does**— the code\n\nThese drift apart constantly. A developer fixes the code, forgets the docs. A security claim stays in the README after the protection was removed. An AI describes what code *should* do, not what it *does*.\n\nThis causes real harm. Security teams audit documentation, not code. AI models trained on documentation inherit the lies.\n\n**This project builds a benchmark to catch it.**\n\nNot a developer? See\n\n[PLAIN_ENGLISH.md]— the problem explained without code.\n\n```\ngit clone https://github.com/02zerocool/truth-benchmark\ncd truth-benchmark\npip install pandas\npython pipeline.py\n```\n\nTest a single claim:\n\n```\npython predict.py \"deletes files\" \"os.remove(path)\"\n# Result: MATCH\n\npython predict.py \"encrypts password before storing\" \"db.save(user, password)\"\n# Result: LIE\n```\n\nVerified on the full 52-example dataset. These are the numbers to beat.\n\n| Baseline | Accuracy | MATCH F1 | LIE F1 | Notes |\n|---|---|---|---|---|\n| Rules (no ML) | 63.5% | 0.72 | 0.46 | Zero dependencies. The floor. |\n| Semantic (all-MiniLM-L6-v2) | 71.2% | 0.74 | 0.68 | 80MB model, CPU-friendly. |\n| LLM (llama3:8b via Ollama) | 75.0% | 0.79 | 0.68 | Best accuracy. Misses subtle numeric lies. |\n\nThe gap between rules and LLM is 11.5 points. The gap between LLM and perfect is 25 points.\n\nThe hardest cases: wrong constants, off-by-one errors, wrong sort direction, missing auth checks.\n\n```\npython baseline_rules.py\n```\n\nHeuristic pattern matching. No ML. Runs anywhere. This is the floor — beat it.\n\n```\npip install sentence-transformers\npython baseline_semantic.py\n```\n\nEmbeds claim and code with `all-MiniLM-L6-v2`\n\n, measures cosine similarity.\n\nDownloads model automatically on first run.\n\n```\n# With Ollama (local, free, offline):\nollama pull llama3.1:8b\npython baseline_llm.py\n\n# With any OpenAI-compatible API:\nLLM_API_URL=https://api.openai.com/v1 LLM_API_KEY=your_key LLM_MODEL=gpt-4o-mini python baseline_llm.py\n```\n\nPlug in your own model with two lines:\n\n``` python\nfrom evaluate import evaluate, print_report\n\ndef my_verifier(claim: str, code: str) -> str:\n    # your model here\n    return \"MATCH\" or \"LIE\"\n\nresults = evaluate(my_verifier)\nprint_report(results, name=\"My Model\")\n```\n\nOutput:\n\n```\n==================================================\n  My Model\n==================================================\n  Total examples : 52\n  Correct        : 38\n  Accuracy       : 73.1%\n\n  [MATCH]  precision=0.71  recall=0.89  f1=0.79\n  [LIE]    precision=0.79  recall=0.54  f1=0.64\n\n  Failures:\n    truth=LIE predicted=MATCH\n      claim : returns unique items preserving order\n      code  : return list(set(items))\n==================================================\n```\n\n`dataset.csv`\n\n— 52 labeled examples across Python, JavaScript, Go, Rust, Java, C#, SQL.\n\n| claim | code | label |\n|---|---|---|\n| deletes files | `os.remove(path)` |\nMATCH |\n| deletes files | `open(path,'w').write('')` |\nLIE |\n| encrypts password before storing | `db.save(user, bcrypt.hash(password, 12))` |\nMATCH |\n| encrypts password before storing | `db.save(user, password)` |\nLIE |\n| sends data over encrypted connection | `requests.get('https://' + url)` |\nMATCH |\n| sends data over encrypted connection | `requests.get('http://' + url)` |\nLIE |\n\nCoverage includes:\n\n- Wrong operators and constants\n- Missing encryption, validation, authentication\n- Wrong sort direction, wrong protocol\n- Subtle traps: byte length vs character length,\n`set()`\n\nvs`dict.fromkeys()`\n\nfor order-preserving uniqueness, soft-delete vs hard-delete,`indexOf > 0`\n\nmissing index 0\n\nAdd examples. Any language. Any domain. All pull requests welcome.\n\n**What makes a good contribution:**\n\n- Both a MATCH and a LIE for the same claim\n- The LIE is plausible — something a real developer might write\n- Claim is plain English, code is a short snippet\n\n**Most wanted — subtle lies:**\n\nCases where the code almost does the right thing but doesn't quite.\n\nThese are the hardest to catch and most dangerous in production.\n\n**Most wanted — security examples:**\n\nMissing encryption, missing validation, wrong protocol, missing auth check.\n\nDocumentation that lies about security is a vulnerability.\n\nSee [CONTRIBUTING.md](/02zerocool/truth-benchmark/blob/main/CONTRIBUTING.md) for format details.\n\n```\ntruth-benchmark/\n├── dataset.csv          # Labeled examples\n├── pipeline.py          # Bare minimum: load and inspect the dataset\n├── evaluate.py          # Evaluation framework — plug in any verifier\n├── baseline_rules.py    # Baseline 1: heuristic rules, no ML\n├── baseline_semantic.py # Baseline 2: semantic similarity\n├── baseline_llm.py      # Baseline 3: LLM-based (Ollama or API)\n├── predict.py           # CLI: verify a single claim/code pair\n├── requirements.txt     # Dependencies (pandas required, rest optional)\n├── PLAIN_ENGLISH.md     # Explanation for non-developers\n└── CONTRIBUTING.md      # How to add examples\n```\n\nThe same gap exists everywhere:\n\n| What they say | What actually happens |\n|---|---|\n| \"Your data is deleted on request\" | Retained in backups for years |\n| \"This AI has no bias\" | Trained on curated data with known gaps |\n| \"This system is independently audited\" | Audited by a subsidiary |\n| \"Encrypted end-to-end\" | Encrypted in transit, plain text at rest |\n\nCode is verifiable. Documentation is a claim.\n\nWe are building a tool that checks claims against evidence.\n\nThat is a universal need.\n\nMIT — free to use, fork, extend, deploy. No restrictions.\n\n**Built with Seven (Claude Sonnet 4.6)**\n\nThe framework design, dataset construction, all three baselines, and documentation were developed in active collaboration with Seven. The problem statement and direction came from the human. The implementation was built together.", "url": "https://wpnews.pro/news/a-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims", "canonical_source": "https://github.com/02zerocool/truth-benchmark", "published_at": "2026-06-14 18:25:20+00:00", "updated_at": "2026-06-14 18:42:33.264080+00:00", "lang": "en", "topics": ["ai-safety", "developer-tools", "machine-learning", "natural-language-processing"], "entities": ["Truth Benchmark", "Ollama", "all-MiniLM-L6-v2", "llama3:8b", "gpt-4o-mini", "Python", "JavaScript", "SQL"], "alternates": {"html": "https://wpnews.pro/news/a-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims", "markdown": "https://wpnews.pro/news/a-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims.md", "text": "https://wpnews.pro/news/a-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims.txt", "jsonld": "https://wpnews.pro/news/a-benchmark-for-catching-when-code-doesn-t-do-what-its-documentation-claims.jsonld"}}