Your LLM Lies Confidently. I Built an Engine That Doesn’t.

A developer built PromptProof, an engine that breaks text into atomic claims, checks each against live web evidence, and returns supported, refuted, or unverifiable verdicts with sources. The system uses chaining, gate checks, and feedback loops to ensure reliability, addressing the problem of LLMs confidently producing incorrect or unsourced answers.

New here? This is Part Two. ReadPart Onefirst, then come back for the build. In part one I made an argument. Clever prompts are cheap now, and the real skill has moved on to building reliable prompting systems . I covered the ideas, from role prompting and Chain of Thought to ReAct and prompt refinement, and then the workflow patterns that turn a single call into something dependable. Those patterns were chaining, gate checks, and feedback loops. I closed by promising that the next article would take all of it and turn it into a real, working engine in Python. This is that engine. It is called PromptProof , and the whole point of it is the title of this piece. Same model, completely different value. The model underneath does not change. What changes is everything around it. The work is split into focused steps, each step is checked before it is trusted, the messy outside world is kept at arm’s length, and the system reviews its own output and tries again when it falls short. The star here is the prompting engine, and the task it performs is cargo. The star is the chaining, the gates, the tool handling, the feedback loop, and the plumbing that makes all of it observable and testable. The cargo is the task the engine performs, which I kept deliberately small so that it never steals the show. We will look at the task first, briefly, and then spend the rest of our time inside the machine. You paste in a short paragraph of prose. The engine breaks it into atomic claims, checks each one against live web evidence, and returns a verdict for every claim, either Supported , Refuted , or Unverifiable , each with a source. That is the entire product surface. Take the example I used in part one. The Sydney Opera House was designed by a Danish architect and opened in 1973. It has over 2,000 rooms and was funded entirely by the Australian federal government. A single clever prompt will happily answer this in one breath, and it will sound confident, and it will be partly wrong, and it will cite nothing. PromptProof does something slower and more honest. It pulls the paragraph apart into four separate claims, goes looking for evidence on each, and rules on them one at a time with a citation attached. This task is good cargo for one reason. It genuinely needs every mechanism from part one. The extraction, the searching, and the judging are three different jobs, so chaining is required rather than decorative. The output has to be a clean, typed structure, so validation has a real purpose. The web search is the part most likely to break, so a gate-checked tool call earns its keep. And the finished report can be incomplete in ways a single step cannot see, so a feedback loop has something useful to do. The contrast is the whole pleasure of it. A task that is simple to describe but that genuinely benefits from the reliability machinery Here is the engine in one breath. Extract the claims, search for evidence on each, judge each claim against its evidence, then evaluate the whole report and revise it if it falls short. Figure 1. System architecture Every arrow between steps is a gate. Underneath it all, a run trace records each step’s attempts, tokens, and timing, and any failure surfaces as a typed error the engine can reason about rather than a crash. That base layer matters as much as the steps. Before we built a single prompt, we built two unglamorous things. A RunTrace object that every step writes to, and a small family of typed errors. Neither does anything clever. Together they are the difference between a script you hope works and a system you can actually inspect when it does not. We will come back to them, because they are what the reliability story rests on. Now to the chain itself. Extraction, searching, and judging are three genuinely different jobs. Ask one prompt to do all three and it does each of them badly, because you are asking the model to hold the whole problem in its head at once. Chaining is the simple, powerful idea of giving each job its own focused prompt and letting the output of one become the input of the next. In PromptProof the chain reads almost like prose. python def run chain paragraph, llm, , transport=tavily transport, max iterations=2 : trace = RunTrace Step 1. Extract halt the run if the gate cannot get a clean list of claims . extracted = extract claims paragraph, llm, trace=trace if extracted.failure is not None: return Report paragraph=paragraph, claims= , verdicts= , trace=trace, failure=extracted.failure Step 2. Retrieve evidence for every claim concurrently . items = gather evidence extracted.claims, transport=transport, trace=trace Steps 3 and 4. Judge, then evaluate and revise until it passes or hits the cap. judged = judge claims items, llm, trace=trace iteration = 1 while True: report = Report paragraph=paragraph, claims=extracted.claims, verdicts=judged.verdicts, evidence=items, trace=trace, failure=judged.failure if judged.failure is not None: return report evaluation = evaluate report report report.evaluation = evaluation if evaluation.passed or iteration = max iterations: return report iteration += 1 judged = judge claims items, llm, trace=trace, feedback=issues as feedback evaluation.issues What to notice is how little the orchestrator knows. It does not parse JSON, it does not handle a timeout, it does not know what a verdict looks like. It just moves clean results from one step to the next, and bails out the moment a step reports that it could not produce something clean. That bailing-out is the part that took discipline. Chaining lets you handle complexity, but it introduces a new failure mode that part one called the falling dominoes. A slip in an early step does not stay in that step, it topples everything downstream. If extraction returns rubbish and the chain shrugs and carries on, the search step searches for rubbish, the judge rules on rubbish, and the final report is confidently, uselessly wrong. The chain is only as trustworthy as the weakest joint between its steps. Which is exactly why the next section exists. A gate check is a programmatic validation placed between two steps. It is a quality-control point that confirms an intermediate output is well formed before it is allowed to flow downstream. In Python the natural tool for this is Pydantic, which turns a type hint into a runtime contract. Here is the contract for a single judged claim. VerdictLabel = Literal "Supported", "Refuted", "Unverifiable" class VerdictModel BaseModel : claim: str verdict: VerdictLabel any other value is rejected automatically reason: str source: str the citation, may be "" when Unverifiable @field validator "claim", "reason" @classmethod def non empty cls, v: str - str: if not v.strip : raise ValueError "must not be empty" return v The Literal is the star of this little class. If the model returns a verdict of "Maybe", or "True", or "Probably refuted", Pydantic rejects it for us, with no extra code. The schema is not documentation that we hope the model reads. It is an enforced boundary that the model's output has to pass through. Now compare two versions of the step that turns the model’s text into claims. Here is the early, ungated version, the one that creates dominoes. python def naive parse text : try: data = json.loads text except json.JSONDecodeError: return the domino, nothing downstream knows anything went wrong return data if isinstance data, list else If the model wraps its answer in a sentence, or returns an object instead of a list, this quietly hands back an empty list and the run sails on into nonsense. No error, no signal, no way to know. The gated version refuses to do that. It validates, and when validation fails it raises a small, typed error carrying a human-readable reason, and then it does something that turns a failure into a recovery. It asks the model again, with the reason for the last failure stitched into the prompt. python def generate and validate , llm, system, user, parse, step name, max attempts=3 : feedback = "" last reason = "" for attempt in range 1, max attempts + 1 : full user = user if not feedback else f"{user}\n\n{feedback}" resp = llm.complete system=system, user=full user try: value = parse resp.text raises GateError on a bad shape except GateError as e: last reason = e.reason ...record the retry in the trace... feedback = f"Your previous response failed validation: {e.reason}. " "Return only valid JSON matching the required schema, and nothing else." continue ...record the success in the trace... return value, None return None, GateFailure step=step name, reason=last reason, raw=resp.text This single function gives us all three of the responses part one described. It can halt , by giving up after a maximum number of attempts and returning a typed GateFailure instead of crashing. It can retry , by simply asking again. And, most usefully, it can retry with feedback , by handing the model the exact reason its last answer was rejected. The third option is the one that earns its keep, because a model that is told precisely why its JSON was malformed is far more likely to get it right on the second try than a model asked the same question twice. Figure 2. The gate-check decision flow One design decision here is easy to get wrong. Modern model APIs offer structured outputs. You hand them a schema and they more or less guarantee valid JSON. It is tempting to lean on that and delete all of this. I chose not to, for a teaching reason and a practical one. The teaching reason is that if the validation happens invisibly inside the SDK, the gate-check pattern disappears from view, and the gate check is the most important idea in the whole system. The practical reason is that structured outputs solve the syntactic problem of valid JSON, but they cannot solve the semantic one. A schema can promise you a string in the source field. It cannot promise you that the string is a real citation for a claim the model marked Supported. We want both kinds of safety, so we keep the gate visible and let it do the syntactic work, and we add a second layer, later, for the semantic work. So far everything has happened inside the model. The judge, at this point in the build, was ruling on each claim purely from its own training data, with no evidence and no sources. That is a deliberate weakness, and it sets up the most important reliability lesson in the project. The lesson is plain. In a real agentic system, the external tools fail far more often than the model does. The model is a fairly reliable component. The web search sitting next to it is not. It times out. It rate-limits you. It returns a payload with a field missing, or an HTML error page where you expected JSON, or a perfectly valid response whose shape changed last Tuesday. Agents fail at the seams, where they reach out and touch something they do not control. So the search step is wrapped in exactly the same gate discipline as the model steps, with one extra concern. The call itself can throw before you ever get a payload to validate. python def search evidence query, , transport=tavily transport, max attempts=3 : last reason = "" for attempt in range 1, max attempts + 1 : try: raw = transport query, max results=3 Action, call the tool except Exception as e: timeout, rate limit, network last reason = f"transport error: {type e . name }: {e}" ...record the retry... continue try: parsed = RawSearchResponse.model validate raw the gate on the raw result except ValidationError as e: last reason = reason from validation e ...record the retry... continue return Evidence title=r.title, url=r.url, snippet=r.content for r in parsed.results , None return None, ToolFailure tool="tavily", reason=last reason, attempts=max attempts The gate sits exactly where part one said it should, between the Action and the Observation. The agent reasons that it needs evidence, it calls the tool, and before that result is allowed to become an observation that the judge will reason about, it has to pass a schema check. A timeout, a rate limit, or a malformed payload triggers a controlled retry. And if every attempt fails, the step does not throw a tantrum up the stack. It returns a typed ToolFailure, and the chain treats that claim as simply having no evidence, which steers the judge towards an honest Unverifiable rather than a guess. Figure 3. The gate-checked ReAct loop With evidence flowing, the judge is rewritten to rule only on what it was given, and to cite the source it relied on. The instruction is blunt. Judge using only the supplied evidence, not your own prior knowledge. For Supported or Refuted, put the URL you relied on in the source. If the evidence does not settle it, say Unverifiable. That single change is what turns the engine from a confident guesser into something that grounds every claim it can and admits the ones it cannot. The searches, by the way, run concurrently. They are independent of one another, so there is no reason to make the user wait for them in single file. The engine fans them out, and the cost of the search phase becomes the slowest single search rather than the sum of all of them. Even with grounded, cited verdicts, a finished report can be wrong as a whole in ways that no single step can see. A claim might be left unjudged. A verdict might say Supported but carry an empty source. The per-item gate cannot catch these, because it only ever looks at one item at a time and that item is individually well formed. You need something that stands back and reviews the whole picture. That something is an evaluator, and it is the practical face of the LLM-as-judge idea that has gone from novelty to standard practice over the last year. In PromptProof the evaluator is deliberately rule-based rather than another model call, because the current best practice for self-correction is iteration against explicit, external criteria, not asking a model to grade its own homework and hoping it is harsh enough. php def evaluate report report - Evaluation: issues = if report.failure is not None: f = report.failure issues.append f"chain did not complete: {f.kind} at {f.step} {f.reason} " return Evaluation passed=False, issues=issues if len report.verdicts = len report.claims : issues.append f"coverage mismatch: {len report.verdicts } verdicts " f"for {len report.claims } claims" for i, v in enumerate report.verdicts : if v.verdict in "Supported", "Refuted" and not v.source.strip : issues.append f"verdict {i} '{v.claim :40 }' is {v.verdict} but cites no source" return Evaluation passed=not issues, issues=issues When the evaluator finds problems, the chain does not give up and it does not silently ship a flawed report. It feeds the list of issues straight back into the judge as feedback and asks for a revision. This is the same retry-with-feedback idea as the gate, lifted up from the level of a single step to the level of the whole report. It loops like this until the report passes or it hits a maximum number of iterations, so it always terminates. This is the student revising an essay after the teacher’s comments, except the teacher is a handful of explicit rules and the student never gets tired. Figure 4. The feedback loop What just happened is the cleanest idea in the whole engine. We now have two layers of safety, and they are different in kind. The gate is narrow and syntactic. It looks at one item and asks whether it has the right shape. The evaluator is wide and semantic. It looks at the whole report and asks whether it has the right substance. A Supported verdict with an empty source sails straight through the gate, because the JSON is perfectly valid, and then trips the evaluator, because a Supported verdict with no citation is not actually finished. Neither layer is enough on its own. Together they are why the engine is reliable rather than merely functional. Figure 5. Two layers of safety The four mechanisms are the body of the engine. The reason I would trust it, though, comes down to three less glamorous things that were built first and have been working underneath everything you have just read. The first is observability. Every step writes to a RunTrace as it goes, recording which step ran, which attempt it was, whether it passed, the reason for any retry, the tokens it used, and how long it took. The result is that the engine narrates itself. Here is a real run, checking two claims about the Eiffel Tower. 1. extract claims attempt=1 outcome=ok tokens in/out =315/30 2. search evidence attempt=1 outcome=ok tokens in/out =0/0 3. search evidence attempt=1 outcome=ok tokens in/out =0/0 4. judge claims attempt=1 outcome=ok tokens in/out =1734/156 5. evaluate report attempt=1 outcome=ok tokens in/out =0/0 TOTAL tokens in/out = 2049/186 When something goes slow or strange, you read the trace instead of guessing. Per-step timing was added to exactly that end, so a slow run tells you immediately whether the cost is in the model calls or the searches. The second is that failures are data, not tracebacks. A GateFailure and a ToolFailure are small typed objects that flow back through the chain and end up on the report, where the evaluator and the interface can reason about them. The engine never falls over. It reports, in a structured way, that it could not finish and why. The third, and the one that makes the whole thing maintainable, is that it is tested without ever calling a model. The single seam that allows this is a thin model abstraction with an injectable client. In production the engine builds a real client. In the tests it is handed a mock that returns scripted responses. The same trick is used for the search transport. The upshot is that the entire chain is verified in continuous integration with zero live calls, no API key, and no flakiness. That covers the gates, the retries, the tool failures, and the feedback loop, every one of them. This is the shape the industry is converging on, where evals are treated like unit tests and the thing you are testing is the orchestration around the model rather than the model itself. A small catalogue of golden examples runs the full pipeline end to end and asserts on the verdicts, and it does it in milliseconds. None of these three is exciting. All three are the reason the four exciting mechanisms can be trusted. The engine ships with two thin interfaces, because the engine is the point and the interface should stay out of its way. There is a command-line tool that prints the report and, with a --trace flag, the run trace you saw above. And there is a minimal browser front end built with Streamlit. You paste a paragraph, press a button, and watch the verdicts come back with their sources. PromptProof checking four claims about Australia. “Sydney is the capital of Australia” and “Australia is the largest continent on Earth” come back Refuted, each with a source. The Great Barrier Reef and kangaroo claims come back Supported. Look closely at that screenshot, because it is exactly the behaviour the title of this piece promised. Two of those four claims are confident, common, and wrong, and a chatbot will often wave them straight through. The engine refuses. “Sydney is the capital of Australia” comes back Refuted , with a citation noting that Sydney is the capital of New South Wales while Canberra is the national capital. “Australia is the largest continent on Earth” comes back Refuted too, against evidence that Asia is the largest and Australia is in fact the smallest. The two true claims, the Great Barrier Reef and the kangaroos, come back Supported , each with its own source. The whole report passed evaluation, and all four claims were checked in under thirteen seconds. Notice what every verdict carries. A source. Nothing here is asserted on the engine’s own authority. Everything is grounded in evidence it actually retrieved. And when the evidence does not settle a claim, the engine says so. It returns Unverifiable rather than bluff, which is the same honesty viewed from the other side. A clever prompt answers confidently whether or not it knows. The engine answers only as far as it can show its working. That restraint is not a bug to be tuned away. It is the entire reason the system exists. Look at what actually changed between a clever one-off prompt and this engine. The model did not. The same model that would have answered the Sydney Opera House paragraph in a single confident, half-wrong breath is the model sitting inside PromptProof. What changed is the system around it. The work is split into focused steps, every step is checked before it is trusted, the unreliable outside world is held behind a gate, and the finished report is reviewed against explicit criteria and sent back for revision when it falls short. Same model, completely different value. That is the argument from part one, made concrete. Reliable prompting is not a trick you perform on the model. It is a system you build around it. The clever prompt is cheap, and getting cheaper, and the engineering that turns it into something dependable is the skill that is actually worth having. The full project is on GitHub at github.com/HelloJahid/PromptProof https://github.com/HelloJahid/PromptProof . It is the chain, the gates, the gate-checked tool, the feedback loop, the trace, the tests, the CLI, and the GUI, small enough to read in an afternoon and structured so that each mechanism lives in its own file. A closing note of honesty, since the engine is built on it. A live agentic pipeline like this makes several real network round-trips for every paragraph, so it is not instant. Reliability and speed pull against each other, and this project deliberately chose reliability. That is the right trade for a fact-checker, and the trace will always tell you exactly what you are paying for. In the next entry I will take this engine somewhere new. For now, the thing I most want you to take away is the one in the title. The model is a commodity. The system you build around it is not. Your LLM Lies Confidently. I Built an Engine That Doesn’t. https://pub.towardsai.net/your-llm-lies-confidently-i-built-an-engine-that-doesnt-2154b4857d59 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.