# "Prove your AI-written code — or get the exact input that breaks it"

> Source: <https://dev.to/ishvatheguru/prove-your-ai-written-code-or-get-the-exact-input-that-breaks-it-5bon>
> Published: 2026-06-24 04:54:39+00:00

tags: python, opensource, ai, devtools

AI coding assistants are fast, and they ship confident bugs. The output looks right, the explanation sounds right, and the failing case turns up in production. The missing piece isn't a smarter generator — it's something that can *check* the generated code and refuse to bluff when it can't.

[ ishvacerto](https://github.com/ishvaproducts-png/ishvacerto) is that gate. Give it a function and a way to check it — its own doctests, your tests, or a reference implementation — and it returns exactly one of three answers:

`REFUTED [doctest] fn=square counterexample: square(3) (got 6, expected 9)`

.The whole promise lives in that third answer. **Never wrong, sometimes silent.** It verifies what it can check and abstains on the rest — which is exactly why it never false-alarms on correct code.

```
pip install ishvacerto
python
from ishvacerto import verify, verify_against_reference

verify(open("f.py").read())                    # uses the code's own doctests
verify(code, tests=[("f(3)", "9")])            # against your tests
verify_against_reference(ai_code, ref, "f")    # where does it diverge from a reference?
```

From the command line (exits `1`

on REFUTED, so it gates CI directly):

```
ishvacerto my_function.py
ishvacerto --ref reference.py --entry my_func ai_generated.py   # differential
ishvacerto --json my_function.py                                # machine-readable
```

You can reproduce the headline numbers yourself — there's a script in the repo:

```
python benchmarks/humaneval_gate.py
```

On the real **HumanEval** benchmark (164 problems), the gate produces **0 false alarms** on the canonical correct solutions, captures a checkable doctest spec on **76/164 (~46%)** of problems, and abstains on the rest. It even flags HumanEval's *own* wrong doctest (problem 47) as a spec/code conflict rather than a false alarm — it caught a benchmark bug instead of blaming the code.

Coverage grows with the spec or reference you give it. The roadmap is a **reference proposer** that retrieves a same-task verified reference for code that ships with no tests, widening reach while keeping false alarms at zero.

The differential mode is the fun part: it generates inputs, runs the candidate and the reference, and shows the **first input where they disagree**. Input generation is signature-agnostic — it produces generic argument tuples, lets the reference filter the valid ones, and abstains if it can't exercise at least one.

Pure Python **standard library**, **zero dependencies**, **13/13** tests, CI green on Python **3.9 / 3.11 / 3.12**, MIT. It **runs entirely on your machine** — no account, no cloud, no telemetry, your code never leaves the box. There's also a VS Code extension that shows the counterexample inline.

It verifies what it can check and **abstains on the rest** — coverage is a function of the spec or reference you give it, never a guess. And the subprocess timeout guards against hangs; it is **not** a security sandbox, so verify code whose source you trust (your own assistant's output) or run it in a container.

It doesn't compete with your AI coder — it makes its output **safe to ship**.

⭐ MIT, free, and the measurements are reproducible: [https://github.com/ishvaproducts-png/ishvacerto](https://github.com/ishvaproducts-png/ishvacerto)

```
pip install ishvacerto
```