"Prove your AI-written code — or get the exact input that breaks it"

A developer released ishvacerto, an open-source Python tool that verifies AI-generated code by running it against doctests, user tests, or a reference implementation, returning a counterexample if the code is wrong or abstaining if it cannot be verified. On the HumanEval benchmark, it produced zero false alarms on correct solutions and flagged a bug in the benchmark itself. The tool runs entirely locally with no dependencies and is available on GitHub.

tags: python, opensource, ai, devtools AI coding assistants are fast, and they ship confident bugs. The output looks right, the explanation sounds right, and the failing case turns up in production. The missing piece isn't a smarter generator — it's something that can check the generated code and refuse to bluff when it can't. ishvacerto https://github.com/ishvaproducts-png/ishvacerto is that gate. Give it a function and a way to check it — its own doctests, your tests, or a reference implementation — and it returns exactly one of three answers: REFUTED doctest fn=square counterexample: square 3 got 6, expected 9 .The whole promise lives in that third answer. Never wrong, sometimes silent. It verifies what it can check and abstains on the rest — which is exactly why it never false-alarms on correct code. pip install ishvacerto python from ishvacerto import verify, verify against reference verify open "f.py" .read uses the code's own doctests verify code, tests= "f 3 ", "9" against your tests verify against reference ai code, ref, "f" where does it diverge from a reference? From the command line exits 1 on REFUTED, so it gates CI directly : ishvacerto my function.py ishvacerto --ref reference.py --entry my func ai generated.py differential ishvacerto --json my function.py machine-readable You can reproduce the headline numbers yourself — there's a script in the repo: python benchmarks/humaneval gate.py On the real HumanEval benchmark 164 problems , the gate produces 0 false alarms on the canonical correct solutions, captures a checkable doctest spec on 76/164 ~46% of problems, and abstains on the rest. It even flags HumanEval's own wrong doctest problem 47 as a spec/code conflict rather than a false alarm — it caught a benchmark bug instead of blaming the code. Coverage grows with the spec or reference you give it. The roadmap is a reference proposer that retrieves a same-task verified reference for code that ships with no tests, widening reach while keeping false alarms at zero. The differential mode is the fun part: it generates inputs, runs the candidate and the reference, and shows the first input where they disagree . Input generation is signature-agnostic — it produces generic argument tuples, lets the reference filter the valid ones, and abstains if it can't exercise at least one. Pure Python standard library , zero dependencies , 13/13 tests, CI green on Python 3.9 / 3.11 / 3.12 , MIT. It runs entirely on your machine — no account, no cloud, no telemetry, your code never leaves the box. There's also a VS Code extension that shows the counterexample inline. It verifies what it can check and abstains on the rest — coverage is a function of the spec or reference you give it, never a guess. And the subprocess timeout guards against hangs; it is not a security sandbox, so verify code whose source you trust your own assistant's output or run it in a container. It doesn't compete with your AI coder — it makes its output safe to ship . ⭐ MIT, free, and the measurements are reproducible: https://github.com/ishvaproducts-png/ishvacerto https://github.com/ishvaproducts-png/ishvacerto pip install ishvacerto