Your LLM Judge Costs More Than the Agent. Gate It in 40 Lines.

wpnews.pro

LLM judge cost is the share of your eval bill spent grading agent output instead of producing it. To control it, run a 40-line offline pre-gate that triages every span with four deterministic rules and escalates only the uncertain tail to the expensive judge. On one trace this cut judge cost share from 50% to 16%.

LLM judge cost is the line item nobody puts on the FinOps dashboard. You add an LLM-as-judge to grade every agent span, you sleep better, and three weeks later the eval layer is quietly billing a third of what the agent itself costs. This post measures that share of your bill spent judging instead of doing, with a 40-line offline meter, and shows the one move that drops it from 50% to 16% on the same trace.

AI disclosure:I drafted this with an AI writing assistant. The tool, both fixtures, and every number below come from a real local run ofjudge_gate.py

on Python 3.13.5, no network, no API key. I ran it, checked the exit codes, hashed the output twice to confirm it's deterministic, and edited every line myself before publishing.

Here's the sentence that set me off. Sattyam Jain wrote it on Dev.to on June 12, in a post arguing you should stop running an LLM judge on every agent call: "if your monitor exceeds ~20–25% of production cost, you built the wrong monitor." (Dev.to) That's a great rule of thumb. It's also unfalsifiable until you can put a number on your monitor. His post sketches the tiered architecture (cheap deterministic heuristics first, expensive judge last) but ships no code you can run against your own trace. So I wrote the missing 40 lines.

The timing isn't an accident. The token bill is coming due across the whole industry right now. TechCrunch reported on June 5 that "Uber blew through its entire 2026 AI coding budget by April," and that a Priceline employee saw "a routine Cursor contract renewal came back 4–5x more expensive." (TechCrunch) Two days earlier the Linux Foundation announced its intent to launch the Tokenomics Foundation — open standards for AI cost management, because, in Jim Zemlin's words, "tokens have become the new unit of technology spend." (Linux Foundation) Everyone's auditing what the agent spends. Almost nobody's auditing what the watchdog spends.

And the watchdog is an LLM call too. You priced the agent. Did you price the thing watching the agent?

judge_gate.py

is a 40-line, offline, keyless, zero-network script. Feed it a JSONL trace; four deterministic rules triage each span as OK / BAD / UNCERTAIN, and only UNCERTAIN ones would reach the expensive judge.--judge-price

and --prod-cost

flags. Substitute your own rates; I ship neutral placeholder units.This is the next piece in a series on controlling agents before they execute, not after. The pre-execution gate gates the agent's action. The success gate decides what to verify in a result. This one is a level up the stack: it doesn't gate the agent at all. It gates the judge — and asks how much that judge is allowed to cost.

Here's the failure mode I keep seeing. Someone reads that agents silently fail (true) and bolts on an LLM-as-judge to grade every step. Every span: a second model call, often a frontier model, sometimes with a chunky rubric prompt. It works. It catches things. Then the finance person asks why the eval bill is the same order of magnitude as the agent bill, and the honest answer is "because we run a full second model over every single thing the first one does."

The number that matters is a ratio. Call it judge cost share: the cost of the judging layer divided by the cost of the production run it's judging.

judge_cost_share = (judge_calls × judge_price) / prod_cost

If that's 8%, fine — cheap insurance. If it's 50%, you didn't add a monitor, you added a co-pilot you're paying full freight for and calling overhead. The whole game is shrinking judge_calls

: the number of spans that actually need a human-grade judgment, versus the spans a dumb deterministic rule can settle for free.

Most spans don't need a judge. A tool either got called or it didn't. A JSON output either parses or it doesn't. A 200 with an empty body is wrong no matter how confident the prose around it sounds. You don't need a frontier model to know []

is not a successful invoice send. You need an if

statement.

The pre-gate is a function. It looks at one span and returns one of three verdicts:

Four rules carry almost all the weight. They're the deterministic heuristics Sattyam Jain pointed at ("did the claimed gate execute?") turned into code:

send_email

, but tools_called

doesn't contain send_email

. Claim without evidence → BAD. (This is the same idea as the If none of those fire and the span has a clean ok: true

200, it's OK. Otherwise the rules abstain and it's UNCERTAIN — escalate. Here's the whole triage:

def triage(span):
    """Return (verdict, rule). UNCERTAIN means 'a human-grade LLM judge is needed'."""
    out = span.get("output")
    if not isinstance(out, dict):                      # output not valid JSON object
        return "BAD", "schema:not-an-object"
    if span.get("claimed_tool") and span["claimed_tool"] not in span.get("tools_called", []):
        return "BAD", "claim-without-evidence"         # said it called X, trace has no X
    if span.get("status") == 200 and not out:          # 200 OK with empty payload
        return "BAD", "200-empty-payload"
    if span.get("arg_hash") and span["arg_hash"] == span.get("prev_arg_hash"):
        return "BAD", "duplicate-span"                 # byte-identical retry of prior call
    if out.get("ok") is True and span.get("status") == 200:
        return "OK", "clean-success"                   # explicit ok + 200, no contradiction
    return "UNCERTAIN", "needs-judge"                  # cheap rules abstain -> escalate

That's it. No network, no key, no model. The judge layer is priced, not called: I count the UNCERTAIN spans and multiply by a price you supply on the command line. I refuse to hardcode a vendor rate — those go stale in a month and I'd rather be honestly empty than confidently wrong about someone's bill.

I built two traces of the same 50-span agent — a support-desk bot doing searches, record updates, email sends, classifications, and reply drafts.

The first, trace_gated.jsonl

, is well-instrumented: each span logs the tool it claimed, the tools actually called, a structured output (an ok

flag where the verdict is clear-cut, a confidence

value or label where it isn't), and an argument hash. The second, trace_naive.jsonl

, is the same agent logging only free-text outputs like {"text": "email sent"}

, the way a lot of agents actually log in the wild. Same work. Different telemetry.

Here's the verbatim output. I didn't touch it:

$ python3 judge_gate.py fixtures/trace_gated.jsonl --judge-price 1 --prod-cost 100
spans total:        50
resolved by gate:   34 (68.0%)  [OK=29 BAD=5]
sent to LLM judge:  16 (32.0%)
judge cost share:   16.0% of prod cost (judge_price=1.0, prod_cost=100.0, budget=25%)
verdict: PASS - judge layer within budget
$ echo $?
0

$ python3 judge_gate.py fixtures/trace_naive.jsonl --judge-price 1 --prod-cost 100
spans total:        50
resolved by gate:   0 (0.0%)  [OK=0 BAD=0]
sent to LLM judge:  50 (100.0%)
judge cost share:   50.0% of prod cost (judge_price=1.0, prod_cost=100.0, budget=25%)
verdict: FAIL - judge layer over budget
$ echo $?
1

Read the two side by side. Same agent, same fifty spans, same --judge-price 1 --prod-cost 100

. The well-instrumented trace sends 16 spans to the judge and lands at 16% cost share: a PASS, exit 0. The free-text trace can't resolve a single span cheaply, sends all 50, and lands at 50%: a FAIL, exit 1, tripping Sattyam Jain's "wrong monitor" line by a mile.

The lever isn't a fancier judge. It's whether your trace carries the four cheap facts a rule can read. Of the 16 spans that did escalate in the gated run, most are genuinely subjective: ambiguous contract summaries (confidence: 0.45

), hedged reply drafts ("I cannot find the order, but it is probably fine."

), borderline intent labels. A handful escalate for a humbler reason — they carry no ok

flag for a cheap rule to confirm, so the gate abstains instead of guessing. Either way, that's the tail you want a human-grade judge on. The other 34? Five were provably broken (one duplicate retry, two claims with no matching tool call, one 200 with an empty body, one non-object output) and the rest were clean successes. None of those needed a model to adjudicate.

I want to be precise about a number I almost fudged. The cost figures are placeholder units (judge_price=1

, prod_cost=100

). I am not telling you a judge call costs a dollar or that your run costs a hundred of anything. Plug in your real per-call judge price and your real run cost. The rate, 32% vs 100% of spans escalating, is the part that's mine: measured, reproducible. The dollars are yours.

Fair objection, and it's the one I'd raise. If the cheap rules are wrong, you've replaced a $50 judge bill with a 16% bill and a stack of bad verdicts. So: how good can a cheap layer actually be?

Two recent papers say: surprisingly good, on the parts that matter. In Cheap Reward Hacking Detection (arXiv:2606.08893, June 8), Belenky, Itria and Johns put a linear probe on a small transformer encoder and detected reward hacking at AUC 0.9467, TPR 0.8296 at 5% FPR, at "roughly four orders of magnitude lower per-trajectory cost" than an LLM-as-judge baseline. And Goal-Autopilot (arXiv:2606.11688) reports a gated finite-state machine that "forbids any terminal 'done' claim whose falsifiable gate did not actually execute and pass," cutting fabrication on SWE-bench Lite from 33.7% to 0.67%. Those are their numbers on their setups, not mine. I'm citing them as evidence that a cheap deterministic layer catches most of what a dear one catches, not as my own result.

My four if

statements are cruder than a trained probe. They don't need to be clever. They need to be right when they're confident and silent when they're not — which is the whole point of the UNCERTAIN bucket. A rule that isn't sure doesn't guess. It escalates. The judge still grades the hard 32%. You just stopped paying it to rubber-stamp the easy 68%.

confidence

. One span in the fixture says confidence: 0.95, "no ambiguity"

and still got escalated, because I refuse to trust a model's own confidence as a cheap signal — that's the kind of self-assessment that lies. If you trust yours, add a fifth rule. I didn't.Export 40–60 spans of a real agent run to JSONL with six fields per span (status

, claimed_tool

, tools_called

, output

, arg_hash

, and prev_arg_hash

carrying the previous span's hash so the duplicate-retry rule can fire), point judge_gate.py

at it, and pass your real --judge-price

and --prod-cost

. If your judge cost share comes back under 10%, ignore me; your monitor's fine. If it comes back at 40%, you've found a line item.

One thing I genuinely don't know yet and would put real money on being argued in the comments: where the honest threshold is. Sattyam Jain says 20–25%. I shipped a default of 25%. But for a low-stakes summarizer, even 10% might be waste, and for an agent that moves money, maybe 40% is cheap. The budget is a --flag

precisely because I don't think there's one right answer.

So I'll ask you: what's the judge cost share on a real eval pipeline you've shipped — and where would you set the budget before it counts as the wrong monitor?

I publish one runnable FinOps tool for AI agents at a time, with the real run log attached. Follow for the next number from the next trace — and drop your judge cost share in the comments, I read every one.

source & further reading

dev.to — original article AI Is Exposing Technical Debt We Learned to Ignore Devs say AI killed their joy. The rot started before ChatGPT. Self-Evolving AI Agents: The Optimizer Is the Easy Part

Your LLM Judge Costs More Than the Agent. Gate It in 40 Lines.

Run your AI side-project on zahid.host