Never Let the LLM Write the Joins

wpnews.pro

The subject graph, and the one architectural line that made me comfortable putting “talk to your database” in front of real users with real permissions.

Part 3 of 4 on building a conversational analytics engine. ~11 min read.

Part 2 built a map: three databases introspected and enriched into a domain graph. Tables, columns, join paths, business meaning, and the security rules, all sitting in a graph with a fast in-memory snapshot in front of it.

A map answers nothing on its own.

Nobody asks “describe the schema.” They ask “how many orders did Bike World place last quarter, in my territory,” and they expect a number, computed from data they are actually allowed to see. This article is the machine that does that.

Back in Part 1, I listed seven walls that break text-to-SQL on real data. This is where four of them fall: the fan-out trap (Wall 3), ambiguity (Wall 4), security (Wall 6), reproducibility (Wall 7), plus the cross-source half of Wall 5. If Part 2 was the “why a map,” this is the “why a particular shape of pipeline.”

And I want to lead with that shape, because it is the single most important idea in this entire series.

The big idea: a query is not “hand the question to an LLM and let it call some tools.” I split it in two.

Phase A is deterministic planning. It decides what to do (tables, entities, filters), scoped to who is asking. One LLM call, for reading intent. Everything else is code.

Phase B is LLM-driven execution. It does the plan (writes SQL, runs it, recovers, answers).

The line between them is the product. Security is never the model’s job. Joins are never the model’s job. The model does the creative parts; code does the parts that must be safe and reproducible.

TL;DR

The subject graph resolves the specific things a user names (“Bike World” becomes StoreID 620). It is a cache, not a copy.

Phase A plans deterministically and produces a frozen plan. Phase B is a bounded ReAct loop that executes it.

The LLM writes only the SELECT and WHERE. Joins come from the graph. Security is injected by rewriting the SQL. Cardinality is fixed by a planner. None of those are left to the model.

Honest limit: today the conversation is single-shot. Follow-up memory is built but not wired.

Throughout, one example: a sales rep I will call Dana, scoped to the Southwest territory, allowed to see sales data, asks “Show orders for customer Bike World.” Watch what each stage does to that sentence.

The subject graph: concepts vs. actual things

The domain graph from Part 2 knows about types. It knows CUSTOMER is a concept that lives in a table with a name column.

It does not know that “Bike World” is a specific store with the key StoreID 620. That is an instance, and instances live in the subject graph.

A canonical entity in the subject graph has:

a stable id (something like entity:organization:bikeworld_a1b2c3d4) a name and type

a set of aliases

a list of source links, each pointing at exactly where the entity lives: StoreID 620, in the StoreID column, of Sales.Store, in the SQL Server source

an access level, so resolved entities are protected by the same machinery as tables

The defining property: it is a cache, not a copy. It is not pre-loaded with every customer. That would be copying the data, the one thing I refuse to do. It starts essentially empty and fills lazily, remembering only the names real users actually ask about. The rows stay in the source database. The subject graph just remembers the mapping from “a name a human typed” to “a key the database understands.”

Resolving a name (the cascade) When the pipeline needs to turn “Bike World” into a key, it runs a cascade, ordered from cheapest-and-most-certain to most-expensive-and-least-certain. It stops as soon as it is confident.

How to read it (with the real numbers, because the numbers are the design):

Cache keyed on the name. Stores unfiltered results; RBAC applies after, so users with different permissions can share an entry.

Exact match: canonical name scores 0.95, an alias 0.90. One clean hit above the 0.70 threshold short-circuits the whole thing.

Fuzzy match: the rapidfuzz ratio scorer at a threshold of 80/100, with tiered confidence so it never pretends to be as sure as an exact match (0.65 to 0.85).

Semantic match: I will be straight with you. It is a placeholder. It sits in the right position in the cascade and returns nothing today. The plan is embedding similarity; the wiring is not done. Wherever you read “exact, fuzzy, semantic, source,” know the semantic rung is a stub.

Source-DB lookup: hit the real table. An exact name scores 0.95, and a success gets written back into the subject graph so the next person skips to a cache hit. That is the lazy cache filling.

Two design choices I would defend in any review:

“Ambiguous” is a first-class outcome. Two close candidates returns “did you mean the Seattle or the Dallas store”, not a coin flip. (This is the fix for Wall 4.)

Resolution is fail-closed. The cascade runs and caches unfiltered, but nothing returns until it passes RBAC: can this user see the entity, its links, and any restricted properties. No scope and no trusted-caller flag means deny, not leak.

Phase A: planning, stage by stage

Dana’s question comes in. Phase A turns it into a plan without writing one character of SQL.

User context assembly. Load who Dana is: role (sales rep), territory (Southwest), permissions (sales schema), saved vocabulary and preferences. Everything downstream is scoped by this.

Intent classification (the one LLM call). Returns a structured verdict: intent type (a SQL query), entity types (CUSTOMER and the order entity), confidence, and a default join type. A second focused call pulls “Bike World” out as a value tied to CUSTOMER. A type gate then throws out any extracted value that does not fit the column’s type.

Semantic resolution (the 2-pass vector search from Part 2). Turns “orders” into the order entity by meaning, not spelling, and flags ambiguity when two concepts score within a hair of each other.

Scope resolution. Turns words like “my team” into row-filter hints, on pre-derived rules, in a few milliseconds. One subtlety I am careful about: this produces relevance hints, not security. Convenience and access control are different things, and conflating them is how you build something convenient and insecure.

Routing (the deterministic core). Maps entity types to real tables, assigns each entity a role (the order entity is the subject, CUSTOMER is a filter) using a four-rule scoring system with no LLM, finds the foreign-key join path by traversing the domain graph, grounds “Bike World” to StoreID 620, and detects whether the query spans sources.

RBAC Checkpoint 1. The first of two security gates. Coarse and early: drop any table the user may not see before the model ever sees the schema. Dana keeps her sales tables. Had she asked about HR salaries, that table would vanish here, and the model would never learn its column names.

The output of Phase A is a frozen plan plus a warm entity cache. No SQL exists yet. Phase A decides everything that has to be decided safely, and hands the model a plan it would have to work hard to misuse.

How to read it: the question flows top to bottom through Phase A (all blue, deterministic, except the one purple intent call), produces a frozen plan, and only then enters the Phase B loop where the model actually works. The two red boxes are the security gates, one per phase.

Phase B: where the model actually works

Phase B is a small state machine with three nodes:

agent node: the LLM reasons and decides the next action

tools node: executes exactly one tool, appends the result

synthesize node: writes the final answer

It loops between thinking and acting, then synthesizes. And it is bounded: at most 10 reasoning iterations and 15 tool calls per query, with a 30-second timeout per tool.

Those limits are not decoration. They are the difference between “the agent retries a failed query once and succeeds” and “the agent burns your budget in an infinite loop of slightly-wrong SQL.” (This is part of the fix for Wall 7.)

The agent has 11 tools but most queries touch two or three. A nice detail: resolve_entity looks like a live tool, but because Phase A already resolved Bike World into the warm cache, it usually just returns the cached answer if the confidence is above 0.5. The same operation is both "a tool the agent can call" and "something Phase A already did."

Generating SQL without letting the model touch the dangerous parts

This is where the two-phase philosophy pays off most concretely.

How to read it: the model (the one purple box) writes only a slice of the query. Everything else is deterministic.

The joins come from the graph, not the model. A join-tree builder traverses the foreign-key edges. No path between two tables means a loud error, never a guessed join. (This is the fix for Wall 2.)

The model gets a whitelist of real columns and writes only the SELECT, WHERE, GROUP BY, and limit. Every column it returns is validated. A made-up or unqualified column rejects the whole generation. Unvalidated SQL never reaches the database.

Filters get merged in by code, not the model.

The load-bearing rule, again: the LLM never writes the joins. A made-up join is a silently wrong answer, and silently wrong is the worst failure an analytics system has.

The fan-out trap (Wall 3, finally solved)

Let me tell you about the bug that taught me to respect cardinality.

Ask for total revenue per customer. The query joins customers to orders to line-items. The moment you SUM across a one-to-many join, the value repeats once per child row.

A customer with one $100 order and five line items reports $500.

The SQL is perfect. The number is garbage. Nobody notices.

So I built a deterministic planner that rewrites the SQL to prevent it. It keys entirely on the cardinality_class metadata from Part 2 (the PRESERVING vs MULTIPLYING tags), touches no table names, never calls the model, and fails open (if it cannot reason safely, it returns the original).

Its strategies, picked by what the query actually references:

Prune a multiplying join that is joined but never used (transitively, so chains collapse).

Semi-join: a child used only in the WHERE becomes an EXISTS subquery, which filters without multiplying.

Per-grain subqueries: two measures at two different grains get computed separately and joined back with a full outer join and null-safe matching.

Count anchoring so a plain row count does not lie either.

Why is this in deterministic code and not the prompt? Because “be careful about fan-out” is right most of the time, and “most of the time” is not a number you put on a financial dashboard. Cardinality correctness is provable from graph metadata, so I prove it.

The second security gate (Wall 6) When the agent calls execute_sql, the SQL is correct but not yet safe. RBAC Checkpoint 2 runs here, and it is the fine-grained one.

Where Checkpoint 1 decided which tables, Checkpoint 2 decides which rows and columns.

It rewrites the SQL’s syntax tree: it injects mandatory and row-level filters straight into the WHERE. Dana's query silently gains a Southwest-territory constraint.

A failed access check injects WHERE 1=0, a silent deny that returns zero rows rather than leaking the existence of restricted data.

Restricted columns are masked on the way out.

The entire RBAC path is deterministic. There is no LLM anywhere in it, by design. Access control you cannot reproduce is access control you cannot defend.

Then the agent synthesizes: it writes “Bike World placed 8 orders last quarter,” with any caveats (like “limited to your territory”) and a confidence, plus the SQL and rows underneath.

The validation mesh

Threaded through both phases is a layer I think of as a mesh, not a stage: eight validators, forty-four checks, firing at hook points all over the pipeline.

How to read it:

Three validators run in Phase A, and two of them (route, entity) can hard-abort the query before the agent loop ever starts. No point generating SQL for a query that was never going to work.

The rest run in Phase B, guarding execution and the answer. Most checks are deterministic; a few are LLM calls (does the answer follow from the data), and those get a strict budget of three LLM validation calls per query.

Two choices that matter:

Every validator is wrapped in try/except. A validator that crashes, or is disabled, never blocks a query. Safety code that can take down the system it protects is a liability, so it is built to fail safe.

Checks have severity tiers: block, warn (becomes a caveat, lowers confidence), or log-only. Not every problem should stop a query. Some just attach a footnote.

Shipped vs. next (an honest naming note): the “memory validator” in this mesh has nothing to do with conversation memory. It checks whether a cached result went stale. The actual conversation is single-shot today. The read side of memory is wired (the system loads your saved vocabulary and scope on every query), but the write-back that would let Dana ask “and the quarter before that?” and have it understand “that” is built and not yet connected. Ask a follow-up today and it is treated as a brand-new question. I would rather say that than imply a multi-turn experience I have not finished.

When the data lives in two databases at once (Wall 5)

Some questions span sources: SQL Server sales joined to Postgres production. There is no cross-database JOIN here, and no copying one source into the other.

A federated executor decomposes the query by source.

It generates and runs a separate, independently RBAC-enforced query against each source, pulling only the rows that survive each source’s own rules.

It joins the results in the application layer, in memory, by matching on entity keys.

The requested join type (inner, left, right, full outer) is honored by deciding which side’s unmatched rows to keep.

Each database only ever sees a query for its own data, scoped to what the user may see. The rows are stitched together only after each has been cleared. The data never leaves its home system. That property is non-negotiable.

Why keep both a plan and a loop?

Fair question: if Phase A plans everything, what is the loop for? They fail in different ways, and I wanted coverage on both.

LLM? (Deterministic plan (Phase A): only intent classification; ReAct loop (Phase B): yes: SQL text + prose)

Gives you (Deterministic plan (Phase A): safety, reproducibility, auditability, speed; ReAct loop (Phase B): execution, self-correction, tool choice)

Costs you (Deterministic plan (Phase A): adaptability (the plan is frozen); ReAct loop (Phase B): nondeterminism (must be fenced))

The loop earns its keep when the model generates SQL with a wrong column, the database errors, and the agent reads the error, regenerates, and succeeds. A straight pipeline cannot self-heal.

The plan earns its keep by making sure the model can retry SQL all it wants but can never retry its way into a forbidden table or an invented join.

The honest current limit: today the agent is locked into Phase A’s plan. It can fix a bad query, but not a bad plan. If routing picked the wrong table, retrying SQL will not help, because the agent cannot re-route mid-flight. The roadmap is to make the planning steps themselves into tools the agent can re-call inside the loop. That is the difference between a system shaped like an agent and one that is fully agentic, and it is the next real piece of work.

Key takeaways

Split the stochastic from the deterministic, and be ruthless about which is which. The model understands questions and writes creative SQL well. It is unaccountable at deciding access and inventing joins. Opposite sides of a hard line.

Resolve the things a user names through a cheap-to-expensive cascade, and let “ambiguous” ask a question instead of guessing. A confident wrong answer is worse than an honest “which one?”

Prove what you can prove with metadata instead of asking the model to be careful. Cardinality and access control are both provable from the graph, so they live in code, never in a prompt.

Make your safety layer unable to take down the thing it protects. Wrap it, budget it, tier it.

In Part 4, I pick up the thread I keep deferring: documents. Everything here was structured data. Real questions often need a number and a paragraph from a PDF. The subject graph is meant to be the hub that ties the two together, and that is the most forward-looking, and most honest, story of the four.

Next up, Part 4: Two Pipelines, Three Graphs, on fusing structured and unstructured retrieval with RAG.

Never Let the LLM Write the Joins was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The Day I Stopped Babysitting My AI and Started Building Loops The Sonnet 5 Price is Not What You Think It Is Embodied AI Agent Architecture: Build Physical-World AI Without Treating Robots Like Chatbots

Never Let the LLM Write the Joins

Run your AI side-project on zahid.host