cd /news/large-language-models/shipping-a-local-llm-api-with-fastap… · home topics large-language-models article
[ARTICLE · art-38188] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Shipping a Local LLM API with FastAPI and Ollama

A developer built a production-grade API for a 3B text-to-SQL model using FastAPI and Ollama, enabling natural language queries against SQLite databases at zero inference cost. The API, part of the de-swarm project, includes six endpoints for health checks, schema exploration, query generation, and read-only execution. In a demo, the model correctly generated a complex multi-table SQL query for NPS scores by plan, running on a $5 VPS with no GPU.

read12 min views1 publishedJun 24, 2026

Phase 2 of the de-swarm project — how I turned a 3B text-to-SQL model into a production API for $0.

Three weeks ago, I distilled a 120B+ text-to-SQL pipeline into a 3B QLoRA fine-tune of Qwen2.5-Coder-3B-Instruct. The model hit 90% in-domain accuracy and 55.5% on Spider, ran on a laptop CPU via Ollama, and cost $0 to train and $0 to inference. I wrote about it in Phase 1.

But "I have a model that runs on my laptop" is a different category of deliverable than "I have an API anyone can call." The first is a research artifact. The second is a product.

Phase 2 was about crossing that gap. This post is the story.

A FastAPI gateway around the Ollama model with six endpoints:

Method Path What it does
GET /health
Liveness + Ollama reachability + available schemas
GET /schemas
List of SQLite DBs the API can execute against
GET /schemas/{name}
Tables, columns, and sample values for a given schema
POST /query
Natural language → SQL (no execution)
POST /execute
SQL → rows (read-only, sandboxed)
POST /query-and-execute
NL → SQL → rows (the magic endpoint)

The full code is at github.com/nurahmad-data/de-swarm-api. About 800 lines of Python across 7 files, 15 tests, Dockerized, deployable to a $5 VPS.

Let me start with the most impressive thing the API does, then work backwards into how it works.

I have a SaaS schema with 16 tables — organizations, users, plans, subscriptions, invoices, support tickets, NPS surveys, API usage logs, webhooks, feature flags, etc. Standard B2B SaaS data model.

I asked the API:

"Count of NPS promoters, passives, and detractors by plan"

The model generated this SQL:

SELECT p.plan_name,
  COUNT(CASE WHEN n.score >= 9 THEN 1 END) AS promoters,
  COUNT(CASE WHEN n.score BETWEEN 7 AND 8 THEN 1 END) AS passives,
  COUNT(CASE WHEN n.score <= 6 THEN 1 END) AS detractors
FROM organizations o
JOIN subscriptions s ON o.org_id = s.org_id
JOIN plans p ON s.plan_id = p.plan_id
JOIN nps_surveys n ON o.org_id = n.org_id
WHERE n.survey_date >= DATE('now', '-30 days')
  AND s.status = 'active'
GROUP BY p.plan_name
ORDER BY p.plan_name ASC;

And returned:

plan_name promoters passives detractors
Enterprise 1 0 0
Pro 4 1 0
Starter 6 3 0

Total time: 31.7 seconds. Total cost: $0.

Let's unpack what the model did:

Knew NPS thresholds from training. The standard NPS scoring is promoters (9-10), passives (7-8), detractors (0-6). This wasn't in the schema context. The model learned it from the training data.

Navigated a 4-table JOIN chain. nps_surveys

organizations

subscriptions

plans

. Foreign keys all the way down.

Added an intelligent filter. WHERE s.status = 'active'

— excluding churned accounts from the NPS breakdown. This wasn't asked for, but it's what an analyst would actually want.

Used CASE WHEN correctly. Three CASE expressions with the right thresholds, wrapped in COUNT. Textbook SQL.

Ran on a 3B model on CPU. No GPU, no cloud API, no OpenAI bill.

That's the moment I knew Phase 2 was working. A 3B model on a laptop wrote SQL that an experienced analyst would write, in 31 seconds, for free.

The API is two containers in production:

┌──────────────────────────────────────────────────┐
│  $5 VPS (Hetzner CX22, 2 vCPU, 4 GB RAM + swap)  │
│                                                  │
│  ┌────────────┐         ┌────────────────────┐  │
│  │   Caddy    │────────▶│   de-swarm-api     │  │
│  │  (TLS 443) │  :8000  │  (FastAPI + uvicorn)│  │
│  └────────────┘         └─────────┬──────────┘  │
│                                   │              │
│                          http://ollama:11434     │
│                                   │              │
│                         ┌─────────▼──────────┐  │
│                         │      Ollama         │  │
│                         │  de-sql-3b-q8       │  │
│                         │  (3.3 GB q8_0 GGUF) │  │
│                         └────────────────────┘  │
└──────────────────────────────────────────────────┘

Locally, it's just uvicorn app.main:app --reload

talking to ollama serve

on localhost.

The API itself is structured into 7 modules:

config.py

— Pydantic settings, all env-drivenauth.py

— X-API-Key middleware with constant-time compareollama.py

— async HTTP client wrapping Ollama's /api/generate

schema_fetcher.py

— read-only SQLite schema introspection with smart samplingexecutor.py

— read-only SQL executor with 5-layer safetymodels.py

— Pydantic request/response contractsmain.py

— FastAPI app, 6 routes, lifespan hooksNo LangGraph, no LangChain, no orchestrator. The training pipeline uses all of that; the API doesn't need to. It's just an HTTP server that calls another HTTP server.

The hardest part of shipping a SQL-generating API wasn't the model — it was the safety story. If the model writes DROP TABLE customers;

, the API needs to refuse to execute it. Even if the model is having a bad day.

I ended up with 5 layers, in order:

The first thing executor.py

does is regex-scan the SQL for forbidden keywords:

_FORBIDDEN_PATTERNS = [
    (r"\bDROP\b",     "DROP statement detected"),
    (r"\bDELETE\b",   "DELETE statement detected"),
    (r"\bUPDATE\b",   "UPDATE statement detected"),
    (r"\bINSERT\b",   "INSERT statement detected"),
    (r"\bALTER\b",    "ALTER statement detected"),
    (r"\bTRUNCATE\b", "TRUNCATE statement detected"),
    (r"\bCREATE\b",   "CREATE statement detected"),
    (r"\bATTACH\b",   "ATTACH statement detected"),
    (r"\bDETACH\b",   "DETACH statement detected"),
    (r"\bPRAGMA\b",   "PRAGMA statement detected"),
    (r"\bVACUUM\b",   "VACUUM statement detected"),
    (r"\bREINDEX\b",  "REINDEX statement detected"),
]

Word boundaries (\b

) prevent false positives on identifiers like updated_at

or delete_flag

. If anything matches, the request is rejected with HTTP 422.

Even if the SQL is benign, the executor splits on ;

and keeps only the first non-empty statement. This kills the classic SELECT 1; DROP TABLE customers;

injection pattern — the DROP never runs.

if ";" in sql:
    first = sql.split(";", 1)[0].strip()
    if first:
        sql = first + ";"

This is the belt-and-suspenders layer. Even if regex misses something and the single-statement enforcement fails, the SQLite connection itself is opened with mode=ro

:

conn = sqlite3.connect(
    f"file:{db_path}?mode=ro",
    uri=True,
    timeout=self._timeout_s,
)

SQLite will physically refuse to execute any write operation on a mode=ro

connection, regardless of what the SQL says. The filesystem layer enforces it.

PRAGMA busy_timeout = 5000

plus a connection timeout. Prevents a slow query from hanging the API.

If the query lacks a LIMIT

clause, the executor applies a post-hoc cap (default 100 rows, hard max 10,000). Prevents the model from accidentally writing SELECT * FROM huge_table

and shipping 10 million rows back to the client.

I ran 22 production queries plus 6 adversarial tests (DROP, DELETE, UPDATE, multi-statement injection, unknown schema, empty SQL). Zero security violations. The 5-layer model held.

Here's a debugging story that taught me something fundamental about small-model inference.

The SaaS schema has 16 tables, 135 columns. Initially, my schema fetcher would sample up to 10 distinct values per TEXT column, to give the model concrete examples of categorical values (statuses, plan names, countries, etc.). Total schema context: 14,800 characters, 536 sample values.

Every SaaS query timed out. 60 seconds. 120 seconds. 300 seconds. Nothing worked.

I tried cutting samples to 3 per column. Still timed out. The schema was still 9,800 characters.

I tried cutting samples to 0. That would have worked, but it would have gutted the model's accuracy — the samples are how it knows status = 'delivered'

instead of status = 'completed'

.

The actual fix was smart sampling — two-stage filtering:

Stage 1: Name-based blocklist. Columns whose names match known free-text patterns are never sampled, regardless of cardinality. The blocklist includes: email

, name

, url

, description

, feedback

, comment

, user_agent

, ip_address

, uuid

, token

, secret

, hash

, json

, metadata

, payload

, body

, content

, text

, subject

, message

, note

, label

, tag

, invoice_number

, order_number

, reference

, key_name

, key_prefix

, scopes

, version

, event_triggers

, triggers

.

Stage 2: Cardinality check. For remaining TEXT columns, fetch COUNT(DISTINCT col)

. If it's > 20, skip — it's free text even though the name didn't match the blocklist.

Result: SaaS schema dropped from 9,800 chars → 8,200 chars, sample count from 202 → 96. Every sample the model actually uses (statuses, plan types, regions, priorities) was preserved. Every useless sample (emails, names, URLs, IPs, JSON blobs) was dropped.

After the fix, every SaaS query completed in 10-32 seconds.

The lesson: on CPU inference, prompt size is the dominant cost. A 3B model on a GPU can chew through 5,000 input tokens in a second. The same model on CPU takes 100-150ms per token — 5000 tokens = 500-750 seconds just for prompt evaluation. Trimming the prompt isn't a micro-optimization; it's the difference between the API working and not working.

I ran 22 queries across three schemas (ecommerce, retail, SaaS), spanning simple 1-table GROUP BYs up to 4-table JOINs with CASE expressions and NOT EXISTS subqueries.

Metric Value
Textbook-perfect SQL 16 / 22 (73%)
Partial (returned data, missed nuance) 3 / 22 (14%)
Wrong (over-filtered, 0 rows) 1 / 22 (5%)
Hit 3B model ceiling (window functions) 3 / 22 (14%)
Timeouts 0
Syntax errors 0
Security violations 0
Median latency 17s
Range 9s – 61s
Cost per query $0
Range Count Notes
<15s 5 Simple 1-2 table queries, warm cache
15-25s 10 Most queries — the sweet spot
25-35s 4 Complex 4-table JOINs, CASE expressions
35-60s 2 Cold cache or very large output
>60s 1 Cold-cache outlier

For a 3B model on CPU with no GPU, this is genuinely usable. You wouldn't put it behind a chatbot — 17 seconds is too long for a conversational UX. But for an analytics API where the user clicks "Generate Report" and waits, 17 seconds is fine. Tableau queries take longer.

strftime()

  • DATE('now', '-N days')

usagefact_sales

  • dim_store

correctlyThree queries asked for window functions. The model dodged all three.

Query: "Running total of revenue by month for the last 12 months"

Asked for: SUM(amount) OVER (ORDER BY month)

— running total

Model generated: SELECT month, SUM(amount) FROM ... GROUP BY month

— monthly totals, no running total

Query: "Top revenue-generating organization in each industry"

Asked for: ROW_NUMBER() OVER (PARTITION BY industry ORDER BY revenue DESC)

— top per group

Model generated: SELECT org_name, SUM(amount) FROM ... GROUP BY org_name ORDER BY revenue DESC

— all orgs ranked, not top-per-industry

This is a well-documented limitation of 3B-parameter models. Even 7B models struggle with window functions without explicit training data. The model recognizes what's being asked (it generates SQL in the right shape) but avoids the window-function syntax.

The honest fix is on the roadmap: a 7B validator fine-tuned on window-function training data, planned for Phase 2.5.

On a GPU, a 3B model is fast — sub-second responses regardless of prompt size. On CPU, prompt evaluation is 100-150ms per token. A 5,000-token prompt takes 500-750 seconds just for prefill, before the model generates a single output token.

This means schema context size isn't a nice-to-have optimization — it's the difference between the API working and not working. Smart sampling (cutting useless samples while preserving useful ones) took my SaaS schema from "every query times out" to "every query completes in 15-30 seconds."

Five layers of defense feels excessive until you watch the model actually try to generate DROP TABLE

(it never did, but it could have). Each layer catches a different failure mode. The regex catches obvious attacks. The single-statement enforcement catches injection patterns. The mode=ro

connection catches anything the regex missed. Together, they make the API safe to expose publicly.

I was tempted to only show the queries that worked. But the window-function failures and the over-eager date filters are more interesting than the perfect queries. They show that I understand the model's ceiling — and that I have a plan to push it.

A hiring manager who reads "the model writes perfect SQL every time" thinks "marketing BS." A hiring manager who reads "the model handles 4-table JOINs and CASE expressions flawlessly but dodges window functions — here's why, here's the fix on the roadmap" thinks "this person actually understands what they built."

The Phase 1 deliverable was a model on HuggingFace. Impressive, but it requires the visitor to install Ollama, download 3.3 GB, copy a Modelfile, and run a CLI command to see anything.

The Phase 2 deliverable is curl https://api.yourdomain.com/query-and-execute -d '{"question":"..."}'

. Anyone can try it in 5 seconds. That's a categorical difference in accessibility — for recruiters, for hiring managers, for anyone who might want to use it.

Phase 2.3 — Schema RAG. Instead of dumping all 16 tables into the prompt, retrieve only the 2-4 tables relevant to the question. For "Total users by plan," the model only needs users

, organizations

, subscriptions

, plans

— not webhooks

, deployments

, audit_log

. Brings SaaS inference down to 5-10s and enables any-DB support.

Phase 2.5 — 7B validator. Fine-tune Qwen2.5-Coder-7B on RCA-labeled failed prompts. Build a complexity router: simple queries go to the 3B, complex ones (window functions, multi-step aggregations) go to the 7B + 3B + validator.

VPS deploy. The Docker compose file is ready. Next weekend I'll spin up a Hetzner CX22, point a domain at it, and have a live demo URL for the portfolio.

The code is at github.com/nurahmad-data/de-swarm-api. The model is at huggingface.co/nurahmad-data/de-sql-3b-v2-gguf. The full test results (22 queries with SQL, latency, and verdicts) are in TEST_RESULTS.md.

If you want to reproduce the benchmark locally:

git clone https://github.com/<your-username>/de-swarm-api.git
cd de-swarm-api
pip install -r requirements.txt
ollama pull hf.co/nurahmad-data/de-sql-3b-v2-gguf
ollama cp hf.co/nurahmad-data/de-sql-3b-v2-gguf de-sql-3b-q8
uvicorn app.main:app --port 8000

curl -X POST http://localhost:8000/query-and-execute \
  -H "Content-Type: application/json" \
  -d '{"question": "Count of NPS promoters, passives, and detractors by plan", "schema": "saas"}'

31 seconds later, you'll have your answer. For $0.

This is the second post in the de-swarm series. Phase 1 covered distilling the 120B teacher pipeline into the 3B student model. Phase 3 will cover scaling the dataset to 140 schemas via Spider. Follow along on [LinkedIn] or GitHub for updates.

── more in #large-language-models 4 stories · sorted by recency
── more on @fastapi 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/shipping-a-local-llm…] indexed:0 read:12min 2026-06-24 ·