Does the code actually do what it says?
Software documentation makes claims. Code tells the truth.
This project builds tools to catch the gap β automatically.
Every piece of software has two parts:
What it says it doesβ documentation, README, comments, AI-generated descriptions** What it actually does**β the code
These drift apart constantly. A developer fixes the code, forgets the docs. A security claim stays in the README after the protection was removed. An AI describes what code should do, not what it does.
This causes real harm. Security teams audit documentation, not code. AI models trained on documentation inherit the lies.
This project builds a benchmark to catch it.
Not a developer? See
[PLAIN_ENGLISH.md]β the problem explained without code.
git clone https://github.com/02zerocool/truth-benchmark
cd truth-benchmark
pip install pandas
python pipeline.py
Test a single claim:
python predict.py "deletes files" "os.remove(path)"
python predict.py "encrypts password before storing" "db.save(user, password)"
Verified on the full 52-example dataset. These are the numbers to beat.
| Baseline | Accuracy | MATCH F1 | LIE F1 | Notes |
|---|---|---|---|---|
| Rules (no ML) | 63.5% | 0.72 | 0.46 | Zero dependencies. The floor. |
| Semantic (all-MiniLM-L6-v2) | 71.2% | 0.74 | 0.68 | 80MB model, CPU-friendly. |
| LLM (llama3:8b via Ollama) | 75.0% | 0.79 | 0.68 | Best accuracy. Misses subtle numeric lies. |
The gap between rules and LLM is 11.5 points. The gap between LLM and perfect is 25 points.
The hardest cases: wrong constants, off-by-one errors, wrong sort direction, missing auth checks.
python baseline_rules.py
Heuristic pattern matching. No ML. Runs anywhere. This is the floor β beat it.
pip install sentence-transformers
python baseline_semantic.py
Embeds claim and code with all-MiniLM-L6-v2
, measures cosine similarity.
Downloads model automatically on first run.
ollama pull llama3.1:8b
python baseline_llm.py
LLM_API_URL=https://api.openai.com/v1 LLM_API_KEY=your_key LLM_MODEL=gpt-4o-mini python baseline_llm.py
Plug in your own model with two lines:
from evaluate import evaluate, print_report
def my_verifier(claim: str, code: str) -> str:
return "MATCH" or "LIE"
results = evaluate(my_verifier)
print_report(results, name="My Model")
Output:
Total examples : 52 Correct : 38 Accuracy : 73.1%
[MATCH] precision=0.71 recall=0.89 f1=0.79 [LIE] precision=0.79 recall=0.54 f1=0.64
Failures: truth=LIE predicted=MATCH claim : returns unique items preserving order
`dataset.csv`
β 52 labeled examples across Python, JavaScript, Go, Rust, Java, C#, SQL.
| claim | code | label |
|---|---|---|
| deletes files | `os.remove(path)` |
MATCH |
| deletes files | `open(path,'w').write('')` |
LIE |
| encrypts password before storing | `db.save(user, bcrypt.hash(password, 12))` |
MATCH |
| encrypts password before storing | `db.save(user, password)` |
LIE |
| sends data over encrypted connection | `requests.get('https://' + url)` |
MATCH |
| sends data over encrypted connection | `requests.get('http://' + url)` |
LIE |
Coverage includes:
- Wrong operators and constants
- Missing encryption, validation, authentication
- Wrong sort direction, wrong protocol
- Subtle traps: byte length vs character length,
`set()`
vs`dict.fromkeys()`
for order-preserving uniqueness, soft-delete vs hard-delete,`indexOf > 0`
missing index 0
Add examples. Any language. Any domain. All pull requests welcome.
**What makes a good contribution:**
- Both a MATCH and a LIE for the same claim
- The LIE is plausible β something a real developer might write
- Claim is plain English, code is a short snippet
**Most wanted β subtle lies:**
Cases where the code almost does the right thing but doesn't quite.
These are the hardest to catch and most dangerous in production.
**Most wanted β security examples:**
Missing encryption, missing validation, wrong protocol, missing auth check.
Documentation that lies about security is a vulnerability.
See [CONTRIBUTING.md](/02zerocool/truth-benchmark/blob/main/CONTRIBUTING.md) for format details.
truth-benchmark/ βββ dataset.csv # Labeled examples βββ pipeline.py # Bare minimum: load and inspect the dataset βββ evaluate.py # Evaluation framework β plug in any verifier βββ baseline_rules.py # Baseline 1: heuristic rules, no ML βββ baseline_semantic.py # Baseline 2: semantic similarity βββ baseline_llm.py # Baseline 3: LLM-based (Ollama or API) βββ predict.py # CLI: verify a single claim/code pair βββ requirements.txt # Dependencies (pandas required, rest optional) βββ PLAIN_ENGLISH.md # Explanation for non-developers βββ CONTRIBUTING.md # How to add examples
The same gap exists everywhere:
| What they say | What actually happens |
|---|---|
| "Your data is deleted on request" | Retained in backups for years |
| "This AI has no bias" | Trained on curated data with known gaps |
| "This system is independently audited" | Audited by a subsidiary |
| "Encrypted end-to-end" | Encrypted in transit, plain text at rest |
Code is verifiable. Documentation is a claim.
We are building a tool that checks claims against evidence.
That is a universal need.
MIT β free to use, fork, extend, deploy. No restrictions.
**Built with Seven (Claude Sonnet 4.6)**
The framework design, dataset construction, all three baselines, and documentation were developed in active collaboration with Seven. The problem statement and direction came from the human. The implementation was built together.