# Frontier LLMs Get 2 of 3 Tax Returns Wrong - Stop Letting Them Decide

> Source: <https://dev.to/webramos/frontier-llms-get-2-of-3-tax-returns-wrong-stop-letting-them-decide-191n>
> Published: 2026-06-30 14:15:06+00:00

Everyone is wiring LLMs into checkout flows right now. I want to make the unpopular case that for the decisions which actually move money - tax, discounts, eligibility, pricing - the model should never have the final say. Not because the models are bad, but because I have the benchmark data showing exactly what happens when they do, and a pattern that fixes it without throwing the LLM out. Here is the evidence, the one condition where it falls apart, and where it pays off.

**TL;DR.** The common advice - "put an AI agent in the loop" - is most dangerous exactly where it sounds most useful: the money path. Tax, promo eligibility, discount stacking, cart rules, pricing. On those decisions a probabilistic model is the wrong final authority. The pattern that holds up is a division of labour: the LLM *formalizes* the natural-language rules into a small, auditable specification, and a *sound deterministic engine* executes it. You stop reviewing code you cannot read and start approving a contract you can. I ran this across 113 experiments and nine model families, and the shape of the result is consistent. Here is the evidence, the one hard condition, and where it actually pays off.

Hand today's best AI models a stack of real tax returns and they get most of them wrong. That is not a thought experiment - it is a benchmark. TaxCalcBench took 51 real 2024 US tax returns with official IRS answers and asked the frontier models to file them. The best performer, Gemini 2.5 Pro, got **32%** right under strict scoring. Claude Opus 4 managed **27%**, Sonnet 4 just **23%**. These were not rounding-error misses - they were wrong returns: wrong tax tables, arithmetic slips, eligibility mistakes, and, worst of all, *different answers every time you asked*. The benchmark's authors put it bluntly: there is a "continued need for deterministic tax calculation engines," because this output is "not acceptable for a task which needs consistently correct results with clear auditability."

Tax is the cleanest example because someone built the benchmark, but the shape generalizes to every money-path decision a merchant runs. Does this cart qualify for the promo? Can these two discounts stack? Which tax jurisdiction applies? What does this customer's loyalty tier unlock? These are not open-ended questions. Each has a *correct* answer defined by rules you already wrote down somewhere. Handing them to a model that pattern- matches probabilistically means accepting an error rate on decisions that move money and that a regulator, an auditor, or an angry customer can ask you to explain.

The reflex fix is "use a bigger model" or "prompt it harder." TaxCalcBench is the counter- argument: the frontier is already here and it is still at 23-32%. The problem is not model size. It is that you are using a probabilistic system as a deterministic one.

The fix is not to remove the LLM. It is to move it. Language models are extraordinary at one thing that traditional software is terrible at: turning messy natural language into structure. They are unreliable at the next thing: executing multi-step logic without drifting. So split the job along that exact seam.

Think of the LLM as a courtroom translator and the engine as the judge. The translator turns the contract from human language into something precise; the judge applies it the same way every time. You would never let the translator also decide the verdict - and you should not let the model that reads your promo terms also rule on whether this cart qualifies.

The payoff is not just accuracy. It is *where the control point sits*. When an LLM decides directly, the thing you would have to audit is a probability distribution over tokens - you cannot. When the LLM instead emits a small set of rules, the thing you audit is the rules. A human confirms "this specification matches our intent." The engine guarantees "the answer follows from the specification." Completeness of the spec does not vanish - it moves to a place a person can actually inspect.

This is not a new idea in the research literature, and the literature is worth knowing because it tells you the gain is real and not a quirk of one setup. Logic-LM - delegating the reasoning step to a symbolic solver - reported **+39.2%** over standard prompting and **+18.4%** over chain-of-thought. CLOVER, which translates to first-order logic and post- verifies with the Z3 solver, pushed harder still: 62.8% versus 42.4% on AR-LSAT, 75.4% versus 45.4% on ZebraLogic. Different teams, different tasks, same direction: when a sound engine does the executing, accuracy on rule-shaped problems jumps.

If you delegate execution to an engine, which engine? In my lab the engine was Prolog - specifically SWI-Prolog, with a constraint solver (CLP(FD)) for the search-heavy problems.

If you have never touched it: Prolog is a language where you do not write *how* to find the answer - you state the facts and the rules and let the engine find it for you. You declare things like "a discount applies if the cart total is over X and the customer is in group Y," and the engine works out the rest, trying possibilities and throwing away the ones that break a rule. A program reads like a list of statements about the world, not a sequence of steps. It reads like a contract and runs like logic.

The obvious objection from any engineer is: I could write that in Python with a few `if`

statements - why drag in a logic engine? The answer is not "Prolog the language beats Python the language." Python can express anything Prolog can - you *could* write the solver by hand. The point is **what you are asking the LLM to produce**, and that difference is concrete:

**The search is the runtime, and it is complete.** Ask "is there a valid configuration?" In Python you hand-write the search - loops, recursion, your own backtracking - and every line is a place to introduce a bug, with no guarantee you explored the whole space. In Prolog, backtracking *is* the engine. The checkable consequence: the engine can return UNSAT with a guarantee - "no such schedule exists" - while a Python script can only tell you "I didn't find one." Absence of evidence is not evidence of absence, and on a refund or a tax edge case that gap is the whole game.

**Correctness is not tied to execution order.** A declarative rule holds or it does not. There is no mutation, no early return, no off-by-one, no "the rules were right but the code that applied them was wrong." That entire bug class is gone.

**A declarative ruleset is a verifiable target; imperative code is not.** This is the one that matters. The LLM's output for a rule is a handful of lines that map one-to-one onto the business contract, and you can *mechanically* check it against probes. Arbitrary Python you can only test - which is precisely the stochastic, "hope the cases cover it" surface you were trying to escape. The Prolog artifact *is* the contract. Python is an implementation of a contract you still have to take on trust.

One honest caveat: the engine does not have to be Prolog. Any sound declarative engine fits the same role - CLP(FD) or MiniZinc for scheduling and search, SAT/SMT solvers like Z3 for policy checks, Datalog or ASP for closed-world rules. There is even tooling (the MCP Solver) that wires LLMs to MiniZinc, PySAT, and Z3 directly. Prolog is what I validated on, not a requirement of the idea.

As for running it: the LLM generates the program, and it executes in an isolated SWI-Prolog subprocess with timeouts and a stack limit - no model in the loop at execution time, only the deterministic engine. That isolation matters for the auditability claim: the answer comes from the engine or it does not come at all.

I tested this across 113 experiments and nine model families - Claude Haiku, Sonnet, and Opus; llama-3.3-70b; mistral-large; deepseek; gpt-oss; gemini; qwen - logging every run, successful or not, to an append-only journal. The point of that breadth is to make sure I was looking at a pattern, not one lucky configuration.

The strongest, most replicated result is on constraint problems - the "zebra puzzle" family, where you satisfy a web of interlocking conditions. On these, the hybrid beat the LLM-only baseline by **+63.3 percentage points** with a cheap llama-3.3-70b and **+19.2 points** with Claude Haiku, at N=120, with 95% confidence intervals that exclude zero. I controlled for the obvious confound - that the gain came from leaking structure into the prompt - by re-running with a neutral, off-task example; the advantage held.

The most striking version of the result: models that solve roughly **0%** of these puzzles on their own reach roughly **100%** once they are formalizing for an engine instead of answering directly. A cheap model with a sound back end performs like a frontier model on this class. That tracks an independent finding too - ChatLogic reported that the gain from delegating to a solver concentrates in weaker models and deeper inference, which is exactly what you would expect if the engine is supplying the rigor the model lacks.

There is a deeper lesson hiding in the failures. When the hybrid was wrong, it was never the engine that was wrong. The engine executes the contract exactly, every time. Every failure lived in the formalization step - the model wrote a specification that did not match the intent. Which leads straight to the one condition you have to respect.

The honest boundary is this: the method works when a capable model does the formalizing. Below a certain capability floor, a weak model writes rules that *look* right and are wrong - and because the engine faithfully executes whatever contract it is handed, a wrong contract produces a confident wrong answer. In my runs, every hybrid failure traced to this step, not to execution.

That is not "it works sometimes." It is "it works given a strong enough translator" - which is a measurable, controllable condition rather than a roll of the dice. It also reframes the model-selection question. You are not buying a model to *be* the decision engine. You are buying it to *write the rules once*, correctly. That is a narrower, more testable bar, and it is the right place to spend capability budget.

Not every task belongs in this pipeline. Before you reach for it, check three things. The method applies when:

Open-world tasks fail the gate: free-form natural language, forecasts with no spec, anything that turns on *interpreting* intent rather than *applying* a rule. That is not a weakness of the method - it is the edge of its class, and knowing the edge is what keeps you from misapplying it.

Rank the opportunities by how badly the current approach hurts and how much auditability is demanded, and a clear order emerges.

**Tax and payroll calculation** sit at the top. The pain is measured (TaxCalcBench), the class is a clean fit, and people already build it this way: SARA, an academic system, translates statutes and facts into Prolog and runs them on SWI-Prolog so that auditors can inspect the exact reasoning path; OpenFisca is an open-source "rules as code" engine for tax and benefit law used by several national governments, sold explicitly on algorithmic transparency.

**Credit underwriting and adverse-action decisions** are the strongest case on the regulatory axis. Under the US Equal Credit Opportunity Act and its Regulation B, a lender must state the *specific* reasons for a denial - not "the algorithm said no." A verifiable rules engine satisfies that by construction; a black-box model does not.

**E-commerce - promo, discounts, eligibility, cart rules, pricing, tax** - is the same decidable class, and it is where most merchants will actually meet this problem as they wire agents into checkout. I will be honest about a gap here: there is no TaxCalcBench for e-commerce, no published benchmark that measures pure-LLM failure on promo logic specifically. The case rests on class-transfer from the measured domains, not on an external industry number. The architecture argument is strong; the industry-measured "pain" number, for now, is not there. Below that sit KYC/AML policy checks, fee and commission calculation, regulatory reporting, and returns policy - the class fits, and there are industry signals (Amazon's VeRAFI work on "neurosymbolic policy generation") even if no one has published a clean measurement.

There is a cost objection: capable models are not free, and if you call one on every decision the bill adds up. The answer is to stop thinking per-call.

When the output you want is not a single answer but a *solver* for a whole class of decisions, the right cost model is amortization. The LLM runs once - as a one-time compiler that turns the natural-language spec into a single verified, parametrized engine. After that, the engine runs on CPU at essentially zero marginal token cost; you inject each instance's data as facts and it computes. In my experiments the total-cost break-even landed around two to three instances - past that, the compiled engine is cheaper *and* more accurate than calling the model each time. The certified engines generalized cleanly to held-out instances: 32 of 32 on a scheduling class (including correctly proving the unsatisfiable ones), 30 of 30 on a promo-stacking class. This is the same shape as the emerging "Compiled AI" idea - call the model once at build time, then run the workflow as static code with zero tokens per transaction.

Two caveats keep this honest. First, certifying that engine is itself capability-gated: of the non-Anthropic model families I tested as the compiler, only one converged reliably; others started at the same place but could not act on the engine's counter-examples to repair their own rules. The gate is the ability to *use* feedback, not raw one-shot quality. Second, an LLM asked to *re-solve* the same problem on every call - rather than compile a solver once - is the wrong tool and the data says so. The win is the compiler framing, not the model as a repeated calculator.

Most "add an AI agent to your store" advice quietly assumes the agent should *make* the decision. For the parts of commerce that move money, that is the one thing it should not do. The agent's real job there is to read the rules and write them down formally; a deterministic engine should decide. You get the model's fluency at the fuzzy edge and the engine's guarantees at the core - accuracy, completeness, and an answer you can defend line by line to an auditor or a customer.

That is a layer the agent-commerce conversation keeps skipping, and it is the one that decides whether any of this is safe to put near a checkout. If you are wiring agents into the money path, the question to ask is not "which model decides?" It is "what does the model *formalize*, and what executes the contract?" Get that seam right and the rest of the architecture has somewhere solid to stand.

*I am currently building an e-commerce project of my own around cart-rule and discount logic - the money-path surface this whole article is about. Real promo rules are where the formalize-don't-decide seam gets stress-tested, and the first place I would scrutinize before letting any agent near a checkout. Designing and building this kind of deterministic layer for merchants is the work I do - if that is the seam you are wrestling with, that is where I can help.*