Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)

wpnews.pro

Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?"

Four of the five runs return p = 0.009

from a paired t-test.

The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018

.

All five reach the same conclusion (significant). But notice what happened: only one run out of five thought to check an assumption you'd want it to check. The other four skipped it. The choice of method — and the test statistic, and the p-value — depended on whether the LLM happened to run an assumption check that time. On borderline data, this is the difference between reject and don't reject.

If you're using LLMs for exploratory data analysis on a weekend project, you might shrug. If you're using them for anything that gets cited, gets submitted to a regulator, or gets handed to a clinician, this is a problem. It's a known problem — Cui & Alexander (2026) documented exactly this kind of method-divergence empirically; AIRepr (Zeng et al., 2025) shows the same thing across reproducibility metrics. The current answer in the literature is to constrain the agent so its execution is replayable. But replayability fixes "did we run the same code." It doesn't fix "did we run the right analysis."

I've spent the last two months building a different fix. The more interesting half is the architecture. Let me walk through it.

The first reflex is "set temperature=0

." It's not enough.

temperature=0

doesn't make a tool-using agent deterministic across runs. Three reasons:

The deeper issue: LLM agents try to do two jobs at once. Choose which analysis to run, and run the analysis. The first is a judgment problem the LLM is reasonably good at. The second is a computation problem the LLM is bad at, because it's inherently stochastic and produces results you can't verify by inspection.

Natural reaction: stop using the LLM for the computation. Write the scipy code yourself.

This is right — but it throws out the half that's actually useful. When a researcher says "compare the post-treatment scores between cohorts and tell me if the intervention worked," the value of the LLM is mapping that informal request to (a) the right columns in the dataframe, (b) the right method given assumptions, (c) the right multiple-comparison correction, (d) a plain-English summary at the end. That mapping is genuinely hard to encode as a fixed program. Throwing the whole LLM out is overcorrecting.

What you actually want: keep the LLM for the routing decision, but pin the computation to a fixed, validated implementation that cannot vary across runs.

That's the architecture:

natural-language request
        │
        ▼
   LLM Supervisor ─────────► chooses ONE next action at a time
        │                    (a tool call, or a final answer)
        ▼
 Deterministic plugin ─────► runs a hardcoded statistical method,
        │                    cross-validated against scipy/statsmodels
        ▼
 Claims ledger + gate ─────► verifies that every reported number came
        │                    from an actual plugin run
        ▼
   Auditable report

This pattern — let the LLM choose tools, but pin the computation — isn't novel. Variants of it show up in domains as different as devops automation and financial reporting. What I think is specific to applying it to statistical inference is the anti-fabrication discipline below: a generic deterministic tool ecosystem still allows the LLM to paraphrase or round the numbers it received. The claims ledger pattern makes that structurally impossible.

I built this as StatGuard Agent. The supervisor LLM (currently gpt-4o

) picks one of 27 hardcoded analysis plugins per step. The plugins do all numerical work; the LLM never emits a number. Given the same plugin and the same arguments, the output is byte-identical across runs — the variability that remains is in plugin selection, which is what the validation framework below targets.

The interesting design choice was not "LLM picks tools" — that's standard agent stuff now. The interesting choice was making sure the LLM never gets to emit a number.

Here's the failure mode I really wanted to prevent. Take the opening example: a paired t-test on the n = 25 dataset returns p = 0.009

. Now the LLM produces a final summary for the user. The most likely failure isn't that the wrong test was chosen — we can catch that in routing tests. The most likely failure is that the LLM, in its summary, writes "p = 0.01"

, or "p < 0.01"

, or hallucinates a confidence interval that nobody computed. Over a multi-step analysis, what got computed and what got reported can drift apart silently.

The pattern that fixes this:

claim_42 = {value: 0.009, kind: "p_value", method: "paired_t", n: 25, ...}

."The intervention shows {claim_42}, suggesting..."

."...shows p = 0.009 (paired t-test, n = 25)..."

.The result: the LLM cannot insert a number that wasn't computed. It cannot round. It cannot round-trip. It cannot paraphrase a statistic into something subtly different. It can only point at claims. A coverage gate also enforces that every required piece of evidence (for a group comparison: test statistic, p-value, effect size, assumption check) has been produced before a final answer is allowed.

I'd argue this pattern should be standard for any agent that produces structured numerical output, not just statistics ones. The principle: LLMs are pointers, not values. Numbers, dates, quotes from documents, monetary amounts — anything where "almost right" is wrong — should be produced by a deterministic tool, given a claim ID, and stitched into the final text by a renderer that the LLM cannot touch.

Two layers of validation.

Layer 1 — plugin carpet benchmark. For every plugin, generate scenarios with fixed seeds and known ground truth, then check the plugin's output against an independent scipy

/statsmodels

computation of the same quantity. The current carpet is 362 cases, all passing. This validates the plugins as plugins, with the LLM out of the picture.

Layer 2 — end-to-end agent benchmark. Drive the full LLM-supervised pipeline on a representative 42-case subset of the same matrix. Each case is judged on four dimensions: (a) the LLM picked the right plugin (routing), (b) the agent reached a final answer (no-error), (c) the claims ledger is clean — every reported number traceable to a plugin run (honesty), (d) the final numerical output is within tolerance of the ground truth (accuracy). Current pass rate: 42/42 on all four.

Plus 764 deterministic unit/integration tests for everything else.

The most useful experience I had was during e2e validation. The first run had 36/38 routing pass — two cases failed because, on prompts framed for FDA submission or audit-grade contexts, the LLM didn't reach for the more rigorous bootstrap mode it should have. That kind of failure isn't a computation bug, it's a judgment bug — and it only surfaces in an e2e benchmark, not a plugin-layer one. I tightened the plugin's use_when

specification with explicit triggers ("FDA", "audit-grade", "clinical", "third-party re-run"), re-ran, got 38/38. The pattern: e2e benchmarks find specification gaps; plugin benchmarks find code gaps.

The bootstrap_inference

plugin produces confidence intervals for paired-difference statistics under percentile, basic, and BCa methods, all cross-validated against scipy.stats.bootstrap

. It also has an opt-in Sequential Bootstrap mode (Peng 2025) for cases where the bootstrap CI itself needs to be more stable across RNG seeds — regulated submissions, audit reports. Every call emits a cross-seed CI endpoint-stability diagnostic so you can compare the two modes on your data.

Up front:

Concrete things on the roadmap:

%%statguard compare cohort_A vs cohort_B

cell magic returning a reproducible report in the next cell is more useful than the current Streamlit-only entry point.If you build agents that produce structured numerical output and want to talk about the claims-ledger pattern, I'd love to hear from you. If you're a statistician with an opinion on what's missing from the plugin set, file an issue. If you're hiring for ML / data engineering / AI applications roles in the US, I'm currently looking — reach out if you're sourcing.

The repo:

An auditable statistical analysis framework that pairs LLM orchestration with a deterministic, cross-validated statistics engine.

StatGuard Agent turns a natural-language analysis request into an end-to-end, reproducible statistical report. It is built on a deliberate separation of concerns:

scipy

/ statsmodels

, never by the LLM itself.This division is the core design principle. A general-purpose LLM asked to "compare these groups" may silently pick the wrong test, skip an assumption check, or report a number it did not actually compute — and may do so differently every time it is run. A traditional tool like SPSS is reproducible but cannot interpret an open-ended request. StatGuard Agent aims for both: as adaptable as…

Stars, issues, and adversarial test cases all welcome.

source & further reading

dev.to — original article How to Build an AI-Powered Web Article Summarizer with Python 🐍 The EU Cyber Resilience Act Has an EOL Problem — and the Deadline Isn't the One You Think Introducing Minotauris: Why One AI Agent Shouldn’t Do Everything

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)

Run your AI side-project on zahid.host