Building a Natural Language Query Interface for Your Database: A Developer's Blueprint

A developer outlines a blueprint for building a natural language query interface for databases using LLMs. The system architecture includes schema retrieval via RAG, prompt assembly, SQL generation, validation, and safe execution. The approach aims to replace one-off SQL queries with plain English questions, improving accessibility for non-technical team members.

Every product team eventually hits the same wall. Marketing wants to know which signups came from last week's campaign. Support wants a list of customers on the Pro plan who opened a ticket in the last 48 hours. The founder wants MRR by cohort, broken down by acquisition channel, by yesterday at 9am. And every time, someone pings the nearest engineer to write yet another one-off SQL query. The dream is obvious: let people ask questions in plain English and get answers from the database. For years, that dream lived in expensive enterprise BI tools. In 2026, with capable LLMs available behind an API call, you can build a credible natural language query interface yourself — or use something like Draxlr https://www.draxlr.com/features/AI/ that ships with AI-powered SQL generation already built in. If you want to understand how it works under the hood, or roll your own, this post walks through what such a system actually looks like: the architecture, the SQL plumbing, the prompt design, and the production failure modes you'll want to plan for before your first user types a question. At its simplest, a text-to-SQL system is a function: php question string - SQL query string - result rows The naive implementation is one prompt to GPT-style model: "Here's my schema, here's the question, write a SQL query." That works for toy demos and falls apart the moment you point it at a real database with 80 tables, 14 of which are named some variation of users . A production-grade interface looks more like this: question | v schema retrieval <-- pull only relevant tables | v prompt assembly <-- schema + examples + guardrails | v SQL generation <-- LLM call | v validation <-- parse, lint, dry-run | v safe execution <-- read-only role, row limits, timeouts | v result + the SQL it ran always show this Each of those stages is a thing you build. Let's walk through them. The single biggest accuracy lever is the schema context you feed the model. Dumping your entire schema into the prompt sounds tempting and is almost always wrong: it blows your context window, costs more, and — counterintuitively — makes the model less accurate because it has to pick the right tables out of a haystack. The fix is retrieval. Treat your schema like a knowledge base, embed it, and pull only what's relevant to the question. Start by capturing a clean description of every table: SELECT c.table name, c.column name, c.data type, pgd.description AS column comment FROM information schema.columns c LEFT JOIN pg catalog.pg statio all tables st ON st.schemaname = c.table schema AND st.relname = c.table name LEFT JOIN pg catalog.pg description pgd ON pgd.objoid = st.relid AND pgd.objsubid = c.ordinal position WHERE c.table schema = 'public' ORDER BY c.table name, c.ordinal position; For each table, build a small text document that the LLM can actually read: Table: subscriptions Description: One row per customer subscription. A user can have at most one active subscription at a time. status is one of active, paused, cancelled. Columns: - id uuid, PK - user id uuid, FK - users.id - plan id uuid, FK - plans.id - status text - mrr cents integer -- monthly recurring revenue in cents - started at timestamptz - cancelled at timestamptz, nullable Sample rows: id=a1.., user id=u3.., plan id=p pro, status=active, mrr cents=4900 Embed each of those documents with any embedding model, store the vectors, and at query time embed the user's question and pull the top 5–10 most similar tables. This is plain RAG, applied to schema instead of documents. The payoff is dramatic. A question like "how many users upgraded to Pro last month?" will retrieve users , subscriptions , and plans — and leave the 70 other tables out of the prompt entirely. Once you have the relevant tables, the prompt itself follows a predictable shape: Here's a stripped-down version in pseudocode: python def build prompt question, tables, examples : return f"""You are a PostgreSQL expert. Generate a single SELECT query to answer the user's question. Rules: - Use only the tables and columns shown below. - Always include LIMIT 1000. - Use ISO date literals e.g. '2026-05-01' . - Never write INSERT, UPDATE, DELETE, DROP, or DDL. - If the question is ambiguous, return a JSON object {{"clarify": "..."}} instead of SQL. Schema {format tables tables } Examples {format examples examples } Question {question} Return only the SQL, no explanation.""" The two underrated pieces here are the examples and the clarification escape hatch . A handful of question/SQL pairs from your domain teaches the model your conventions — that mrr cents is in cents, that "active users" means status = 'active' AND last seen at now - interval '30 days' , that you always join on tenant id . Three good examples often beat a thousand words of instruction. The clarification hatch is the difference between a tool that hallucinates confidently and one that admits when a question is too vague. Ambiguous natural language is the 1 source of bad SQL, and giving the model a way to ask back is far better than letting it guess. Never run an LLM-generated query straight against your database. There are three layers worth wiring up. Parse it. Run the SQL through a parser like sqlglot or pgsql-parser . If it doesn't parse, you have a clean signal to either retry or report the error — no need to wait for the database to reject it. Lint it. Walk the parsed AST and reject anything that isn't a SELECT . Reject queries that reference tables outside your allowed list. Reject queries without a LIMIT . This is your defence against a creative model that decides DELETE FROM users is a reasonable answer. python from sqlglot import parse one, exp def is safe sql, allowed tables : tree = parse one sql, read="postgres" if not isinstance tree, exp.Select : return False, "only SELECT allowed" for t in tree.find all exp.Table : if t.name not in allowed tables: return False, f"table {t.name} not allowed" if not tree.args.get "limit" : return False, "missing LIMIT clause" return True, None Dry-run it. PostgreSQL has EXPLAIN . Run the query under EXPLAIN not EXPLAIN ANALYZE — that executes it to confirm the planner accepts it. If EXPLAIN returns an estimated cost above some threshold, refuse to run it or warn the user. The query has parsed, linted, and been planned. Now run it — but not as your application's regular DB user. Create a dedicated read-only role with the minimum privileges needed: CREATE ROLE nl query runner LOGIN PASSWORD '...'; REVOKE ALL ON ALL TABLES IN SCHEMA public FROM nl query runner; GRANT SELECT ON users, subscriptions, plans, events TO nl query runner; ALTER ROLE nl query runner SET statement timeout = '10s'; ALTER ROLE nl query runner SET default transaction read only = on; Statement timeouts catch runaway queries. default transaction read only is belt-and-braces protection in case your lint layer ever has a bug. Granting SELECT only on the specific tables you've exposed means even a perfectly-crafted injection can't touch your secrets table. For multi-tenant apps, layer Postgres row-level security on top, so even a query like SELECT FROM subscriptions only sees the calling tenant's rows. I wrote about this in a previous post on row-level security for embedded dashboards. Imagine a SaaS analytics app where a user types: "What was our MRR from Pro customers in April?" Here's what happens: subscriptions , plans , and users . SELECT SUM s.mrr cents / 100.0 AS mrr dollars FROM subscriptions s JOIN plans p ON p.id = s.plan id WHERE p.name = 'Pro' AND s.status = 'active' AND s.started at <= '2026-04-30' AND s.cancelled at IS NULL OR s.cancelled at '2026-04-30' LIMIT 1000; LIMIT .| mrr dollars | |---| | 48,372.00 | The UI shows the answer and the SQL that produced it. Always show the SQL. Users learn to trust the system faster when they can see its work, and your power users will start tweaking the SQL directly. A few traps you'll hit if you don't plan for them: Trusting the model on dates. LLMs are weirdly bad at "last week" vs. "the last 7 days" vs. "the previous calendar week". Resolve relative time expressions in your code before generating SQL, and inject explicit dates into the prompt. Ambiguous column names. If you have users.created at and subscriptions.created at , a question like "how many created this month?" is genuinely ambiguous. Detect this and fall back to the clarification hatch instead of guessing. Pre-aggregated columns. If you have a daily metrics rollup table, the model may not know whether to query it or recompute from raw events. Document this explicitly in the table description: "Prefer this table for date-based aggregations; events table only for event-level analysis." Joins that explode. A model can produce a cartesian join with a missing condition and pull back a billion rows. The EXPLAIN cost check and the statement timeout are your safety net. Currency, units, and time zones. mrr cents is not mrr dollars . started at might be UTC; the user's "April" might mean Pacific Time. Encode these in your schema documentation and your few-shot examples, or expect surprising answers. Showing only the answer, not the SQL. This destroys trust. The first time a user gets a number that looks off and has no way to inspect it, they'll stop using your tool. Always show the query. A practical natural-language-to-SQL interface is not a single LLM call. It's a small pipeline: retrieve the relevant schema, assemble a tight prompt with examples and rules, generate the SQL, validate it before it ever touches the database, and run it under a tightly-scoped role. The LLM is the interesting part, but the boring infrastructure around it — schema retrieval, parsing, linting, read-only roles, timeouts — is what separates a demo from a product. Build for the failure cases from day one. Add a clarification path for ambiguous questions. Always show the generated SQL. Log every question and query so you can mine them for new few-shot examples. The systems that work in production are the ones that treat the LLM as a junior analyst whose output always gets reviewed, not as an oracle. Have you built a natural language interface for your own database, or are you using one in a product? What broke first when real users started typing into it? Drop your war stories — and the tools you reached for — in the comments. I'm especially curious which retrieval strategies have worked for people with very large or very messy schemas.