{"slug": "shipping-a-local-llm-api-with-fastapi-and-ollama", "title": "Shipping a Local LLM API with FastAPI and Ollama", "summary": "A developer built a production-grade API for a 3B text-to-SQL model using FastAPI and Ollama, enabling natural language queries against SQLite databases at zero inference cost. The API, part of the de-swarm project, includes six endpoints for health checks, schema exploration, query generation, and read-only execution. In a demo, the model correctly generated a complex multi-table SQL query for NPS scores by plan, running on a $5 VPS with no GPU.", "body_md": "*Phase 2 of the de-swarm project — how I turned a 3B text-to-SQL model into a production API for $0.*\n\nThree weeks ago, I distilled a 120B+ text-to-SQL pipeline into a 3B QLoRA fine-tune of Qwen2.5-Coder-3B-Instruct. The model hit 90% in-domain accuracy and 55.5% on Spider, ran on a laptop CPU via Ollama, and cost $0 to train and $0 to inference. I wrote about it in [Phase 1](https://medium.com/@ahmad.dtranslator/how-i-distilled-a-120b-text-to-sql-pipeline-into-a-3b-model-that-runs-on-a-consumer-laptop-2af0004e8a6d).\n\nBut \"I have a model that runs on my laptop\" is a different category of deliverable than \"I have an API anyone can call.\" The first is a research artifact. The second is a product.\n\nPhase 2 was about crossing that gap. This post is the story.\n\nA FastAPI gateway around the Ollama model with six endpoints:\n\n| Method | Path | What it does |\n|---|---|---|\n| GET | `/health` |\nLiveness + Ollama reachability + available schemas |\n| GET | `/schemas` |\nList of SQLite DBs the API can execute against |\n| GET | `/schemas/{name}` |\nTables, columns, and sample values for a given schema |\n| POST | `/query` |\nNatural language → SQL (no execution) |\n| POST | `/execute` |\nSQL → rows (read-only, sandboxed) |\n| POST | `/query-and-execute` |\nNL → SQL → rows (the magic endpoint) |\n\nThe full code is at github.com/nurahmad-data/de-swarm-api. About 800 lines of Python across 7 files, 15 tests, Dockerized, deployable to a $5 VPS.\n\nLet me start with the most impressive thing the API does, then work backwards into how it works.\n\nI have a SaaS schema with 16 tables — organizations, users, plans, subscriptions, invoices, support tickets, NPS surveys, API usage logs, webhooks, feature flags, etc. Standard B2B SaaS data model.\n\nI asked the API:\n\n\"Count of NPS promoters, passives, and detractors by plan\"\n\nThe model generated this SQL:\n\n```\nSELECT p.plan_name,\n  COUNT(CASE WHEN n.score >= 9 THEN 1 END) AS promoters,\n  COUNT(CASE WHEN n.score BETWEEN 7 AND 8 THEN 1 END) AS passives,\n  COUNT(CASE WHEN n.score <= 6 THEN 1 END) AS detractors\nFROM organizations o\nJOIN subscriptions s ON o.org_id = s.org_id\nJOIN plans p ON s.plan_id = p.plan_id\nJOIN nps_surveys n ON o.org_id = n.org_id\nWHERE n.survey_date >= DATE('now', '-30 days')\n  AND s.status = 'active'\nGROUP BY p.plan_name\nORDER BY p.plan_name ASC;\n```\n\nAnd returned:\n\n| plan_name | promoters | passives | detractors |\n|---|---|---|---|\n| Enterprise | 1 | 0 | 0 |\n| Pro | 4 | 1 | 0 |\n| Starter | 6 | 3 | 0 |\n\nTotal time: **31.7 seconds.** Total cost: **$0.**\n\nLet's unpack what the model did:\n\n**Knew NPS thresholds from training.** The standard NPS scoring is promoters (9-10), passives (7-8), detractors (0-6). This wasn't in the schema context. The model learned it from the training data.\n\n**Navigated a 4-table JOIN chain.** `nps_surveys`\n\n→ `organizations`\n\n→ `subscriptions`\n\n→ `plans`\n\n. Foreign keys all the way down.\n\n**Added an intelligent filter.** `WHERE s.status = 'active'`\n\n— excluding churned accounts from the NPS breakdown. This wasn't asked for, but it's what an analyst would actually want.\n\n**Used CASE WHEN correctly.** Three CASE expressions with the right thresholds, wrapped in COUNT. Textbook SQL.\n\n**Ran on a 3B model on CPU.** No GPU, no cloud API, no OpenAI bill.\n\nThat's the moment I knew Phase 2 was working. A 3B model on a laptop wrote SQL that an experienced analyst would write, in 31 seconds, for free.\n\nThe API is two containers in production:\n\n```\n┌──────────────────────────────────────────────────┐\n│  $5 VPS (Hetzner CX22, 2 vCPU, 4 GB RAM + swap)  │\n│                                                  │\n│  ┌────────────┐         ┌────────────────────┐  │\n│  │   Caddy    │────────▶│   de-swarm-api     │  │\n│  │  (TLS 443) │  :8000  │  (FastAPI + uvicorn)│  │\n│  └────────────┘         └─────────┬──────────┘  │\n│                                   │              │\n│                          http://ollama:11434     │\n│                                   │              │\n│                         ┌─────────▼──────────┐  │\n│                         │      Ollama         │  │\n│                         │  de-sql-3b-q8       │  │\n│                         │  (3.3 GB q8_0 GGUF) │  │\n│                         └────────────────────┘  │\n└──────────────────────────────────────────────────┘\n```\n\nLocally, it's just `uvicorn app.main:app --reload`\n\ntalking to `ollama serve`\n\non localhost.\n\nThe API itself is structured into 7 modules:\n\n`config.py`\n\n— Pydantic settings, all env-driven`auth.py`\n\n— X-API-Key middleware with constant-time compare`ollama.py`\n\n— async HTTP client wrapping Ollama's `/api/generate`\n\n`schema_fetcher.py`\n\n— read-only SQLite schema introspection with smart sampling`executor.py`\n\n— read-only SQL executor with 5-layer safety`models.py`\n\n— Pydantic request/response contracts`main.py`\n\n— FastAPI app, 6 routes, lifespan hooksNo LangGraph, no LangChain, no orchestrator. The training pipeline uses all of that; the API doesn't need to. It's just an HTTP server that calls another HTTP server.\n\nThe hardest part of shipping a SQL-generating API wasn't the model — it was the safety story. If the model writes `DROP TABLE customers;`\n\n, the API needs to refuse to execute it. Even if the model is having a bad day.\n\nI ended up with 5 layers, in order:\n\nThe first thing `executor.py`\n\ndoes is regex-scan the SQL for forbidden keywords:\n\n```\n_FORBIDDEN_PATTERNS = [\n    (r\"\\bDROP\\b\",     \"DROP statement detected\"),\n    (r\"\\bDELETE\\b\",   \"DELETE statement detected\"),\n    (r\"\\bUPDATE\\b\",   \"UPDATE statement detected\"),\n    (r\"\\bINSERT\\b\",   \"INSERT statement detected\"),\n    (r\"\\bALTER\\b\",    \"ALTER statement detected\"),\n    (r\"\\bTRUNCATE\\b\", \"TRUNCATE statement detected\"),\n    (r\"\\bCREATE\\b\",   \"CREATE statement detected\"),\n    (r\"\\bATTACH\\b\",   \"ATTACH statement detected\"),\n    (r\"\\bDETACH\\b\",   \"DETACH statement detected\"),\n    (r\"\\bPRAGMA\\b\",   \"PRAGMA statement detected\"),\n    (r\"\\bVACUUM\\b\",   \"VACUUM statement detected\"),\n    (r\"\\bREINDEX\\b\",  \"REINDEX statement detected\"),\n]\n```\n\nWord boundaries (`\\b`\n\n) prevent false positives on identifiers like `updated_at`\n\nor `delete_flag`\n\n. If anything matches, the request is rejected with HTTP 422.\n\nEven if the SQL is benign, the executor splits on `;`\n\nand keeps only the first non-empty statement. This kills the classic `SELECT 1; DROP TABLE customers;`\n\ninjection pattern — the DROP never runs.\n\n```\nif \";\" in sql:\n    first = sql.split(\";\", 1)[0].strip()\n    if first:\n        sql = first + \";\"\n```\n\nThis is the belt-and-suspenders layer. Even if regex misses something and the single-statement enforcement fails, the SQLite connection itself is opened with `mode=ro`\n\n:\n\n```\nconn = sqlite3.connect(\n    f\"file:{db_path}?mode=ro\",\n    uri=True,\n    timeout=self._timeout_s,\n)\n```\n\nSQLite will physically refuse to execute any write operation on a `mode=ro`\n\nconnection, regardless of what the SQL says. The filesystem layer enforces it.\n\n`PRAGMA busy_timeout = 5000`\n\nplus a connection timeout. Prevents a slow query from hanging the API.\n\nIf the query lacks a `LIMIT`\n\nclause, the executor applies a post-hoc cap (default 100 rows, hard max 10,000). Prevents the model from accidentally writing `SELECT * FROM huge_table`\n\nand shipping 10 million rows back to the client.\n\nI ran 22 production queries plus 6 adversarial tests (DROP, DELETE, UPDATE, multi-statement injection, unknown schema, empty SQL). **Zero security violations.** The 5-layer model held.\n\nHere's a debugging story that taught me something fundamental about small-model inference.\n\nThe SaaS schema has 16 tables, 135 columns. Initially, my schema fetcher would sample up to 10 distinct values per TEXT column, to give the model concrete examples of categorical values (statuses, plan names, countries, etc.). Total schema context: **14,800 characters, 536 sample values.**\n\nEvery SaaS query timed out. 60 seconds. 120 seconds. 300 seconds. Nothing worked.\n\nI tried cutting samples to 3 per column. Still timed out. The schema was still 9,800 characters.\n\nI tried cutting samples to 0. That would have worked, but it would have gutted the model's accuracy — the samples are how it knows `status = 'delivered'`\n\ninstead of `status = 'completed'`\n\n.\n\nThe actual fix was **smart sampling** — two-stage filtering:\n\n**Stage 1: Name-based blocklist.** Columns whose names match known free-text patterns are never sampled, regardless of cardinality. The blocklist includes: `email`\n\n, `name`\n\n, `url`\n\n, `description`\n\n, `feedback`\n\n, `comment`\n\n, `user_agent`\n\n, `ip_address`\n\n, `uuid`\n\n, `token`\n\n, `secret`\n\n, `hash`\n\n, `json`\n\n, `metadata`\n\n, `payload`\n\n, `body`\n\n, `content`\n\n, `text`\n\n, `subject`\n\n, `message`\n\n, `note`\n\n, `label`\n\n, `tag`\n\n, `invoice_number`\n\n, `order_number`\n\n, `reference`\n\n, `key_name`\n\n, `key_prefix`\n\n, `scopes`\n\n, `version`\n\n, `event_triggers`\n\n, `triggers`\n\n.\n\n**Stage 2: Cardinality check.** For remaining TEXT columns, fetch `COUNT(DISTINCT col)`\n\n. If it's > 20, skip — it's free text even though the name didn't match the blocklist.\n\nResult: SaaS schema dropped from **9,800 chars → 8,200 chars**, sample count from **202 → 96**. Every sample the model actually uses (statuses, plan types, regions, priorities) was preserved. Every useless sample (emails, names, URLs, IPs, JSON blobs) was dropped.\n\nAfter the fix, every SaaS query completed in 10-32 seconds.\n\nThe lesson: **on CPU inference, prompt size is the dominant cost.** A 3B model on a GPU can chew through 5,000 input tokens in a second. The same model on CPU takes 100-150ms per token — 5000 tokens = 500-750 seconds just for prompt evaluation. Trimming the prompt isn't a micro-optimization; it's the difference between the API working and not working.\n\nI ran 22 queries across three schemas (ecommerce, retail, SaaS), spanning simple 1-table GROUP BYs up to 4-table JOINs with CASE expressions and NOT EXISTS subqueries.\n\n| Metric | Value |\n|---|---|\n| Textbook-perfect SQL | 16 / 22 (73%) |\n| Partial (returned data, missed nuance) | 3 / 22 (14%) |\n| Wrong (over-filtered, 0 rows) | 1 / 22 (5%) |\n| Hit 3B model ceiling (window functions) | 3 / 22 (14%) |\n| Timeouts | 0 |\n| Syntax errors | 0 |\n| Security violations | 0 |\n| Median latency | 17s |\n| Range | 9s – 61s |\n| Cost per query | $0 |\n\n| Range | Count | Notes |\n|---|---|---|\n| <15s | 5 | Simple 1-2 table queries, warm cache |\n| 15-25s | 10 | Most queries — the sweet spot |\n| 25-35s | 4 | Complex 4-table JOINs, CASE expressions |\n| 35-60s | 2 | Cold cache or very large output |\n| >60s | 1 | Cold-cache outlier |\n\nFor a 3B model on CPU with no GPU, this is genuinely usable. You wouldn't put it behind a chatbot — 17 seconds is too long for a conversational UX. But for an analytics API where the user clicks \"Generate Report\" and waits, 17 seconds is fine. Tableau queries take longer.\n\n`strftime()`\n\n+ `DATE('now', '-N days')`\n\nusage`fact_sales`\n\n+ `dim_store`\n\ncorrectlyThree queries asked for window functions. The model dodged all three.\n\n**Query:** \"Running total of revenue by month for the last 12 months\"\n\n**Asked for:** `SUM(amount) OVER (ORDER BY month)`\n\n— running total\n\n**Model generated:** `SELECT month, SUM(amount) FROM ... GROUP BY month`\n\n— monthly totals, no running total\n\n**Query:** \"Top revenue-generating organization in each industry\"\n\n**Asked for:** `ROW_NUMBER() OVER (PARTITION BY industry ORDER BY revenue DESC)`\n\n— top per group\n\n**Model generated:** `SELECT org_name, SUM(amount) FROM ... GROUP BY org_name ORDER BY revenue DESC`\n\n— all orgs ranked, not top-per-industry\n\nThis is a well-documented limitation of 3B-parameter models. Even 7B models struggle with window functions without explicit training data. The model *recognizes* what's being asked (it generates SQL in the right shape) but *avoids* the window-function syntax.\n\nThe honest fix is on the roadmap: a 7B validator fine-tuned on window-function training data, planned for Phase 2.5.\n\nOn a GPU, a 3B model is fast — sub-second responses regardless of prompt size. On CPU, prompt evaluation is 100-150ms per token. A 5,000-token prompt takes 500-750 seconds just for prefill, before the model generates a single output token.\n\nThis means schema context size isn't a nice-to-have optimization — it's the difference between the API working and not working. Smart sampling (cutting useless samples while preserving useful ones) took my SaaS schema from \"every query times out\" to \"every query completes in 15-30 seconds.\"\n\nFive layers of defense feels excessive until you watch the model actually try to generate `DROP TABLE`\n\n(it never did, but it could have). Each layer catches a different failure mode. The regex catches obvious attacks. The single-statement enforcement catches injection patterns. The `mode=ro`\n\nconnection catches anything the regex missed. Together, they make the API safe to expose publicly.\n\nI was tempted to only show the queries that worked. But the window-function failures and the over-eager date filters are more interesting than the perfect queries. They show that I understand the model's ceiling — and that I have a plan to push it.\n\nA hiring manager who reads \"the model writes perfect SQL every time\" thinks \"marketing BS.\" A hiring manager who reads \"the model handles 4-table JOINs and CASE expressions flawlessly but dodges window functions — here's why, here's the fix on the roadmap\" thinks \"this person actually understands what they built.\"\n\nThe Phase 1 deliverable was a model on HuggingFace. Impressive, but it requires the visitor to install Ollama, download 3.3 GB, copy a Modelfile, and run a CLI command to see anything.\n\nThe Phase 2 deliverable is `curl https://api.yourdomain.com/query-and-execute -d '{\"question\":\"...\"}'`\n\n. Anyone can try it in 5 seconds. That's a categorical difference in accessibility — for recruiters, for hiring managers, for anyone who might want to use it.\n\n**Phase 2.3 — Schema RAG.** Instead of dumping all 16 tables into the prompt, retrieve only the 2-4 tables relevant to the question. For \"Total users by plan,\" the model only needs `users`\n\n, `organizations`\n\n, `subscriptions`\n\n, `plans`\n\n— not `webhooks`\n\n, `deployments`\n\n, `audit_log`\n\n. Brings SaaS inference down to 5-10s and enables any-DB support.\n\n**Phase 2.5 — 7B validator.** Fine-tune Qwen2.5-Coder-7B on RCA-labeled failed prompts. Build a complexity router: simple queries go to the 3B, complex ones (window functions, multi-step aggregations) go to the 7B + 3B + validator.\n\n**VPS deploy.** The Docker compose file is ready. Next weekend I'll spin up a Hetzner CX22, point a domain at it, and have a live demo URL for the portfolio.\n\nThe code is at github.com/nurahmad-data/de-swarm-api. The model is at huggingface.co/nurahmad-data/de-sql-3b-v2-gguf. The full test results (22 queries with SQL, latency, and verdicts) are in TEST_RESULTS.md.\n\nIf you want to reproduce the benchmark locally:\n\n```\ngit clone https://github.com/<your-username>/de-swarm-api.git\ncd de-swarm-api\npip install -r requirements.txt\nollama pull hf.co/nurahmad-data/de-sql-3b-v2-gguf\nollama cp hf.co/nurahmad-data/de-sql-3b-v2-gguf de-sql-3b-q8\n# add your SQLite DBs to ./data/\nuvicorn app.main:app --port 8000\n\ncurl -X POST http://localhost:8000/query-and-execute \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"question\": \"Count of NPS promoters, passives, and detractors by plan\", \"schema\": \"saas\"}'\n```\n\n31 seconds later, you'll have your answer. For $0.\n\n*This is the second post in the de-swarm series. Phase 1 covered distilling the 120B teacher pipeline into the 3B student model. Phase 3 will cover scaling the dataset to 140 schemas via Spider. Follow along on [LinkedIn] or GitHub for updates.*", "url": "https://wpnews.pro/news/shipping-a-local-llm-api-with-fastapi-and-ollama", "canonical_source": "https://dev.to/nur_ahmad_data/shipping-a-local-llm-api-with-fastapi-and-ollama-386l", "published_at": "2026-06-24 17:42:44+00:00", "updated_at": "2026-06-24 18:09:14.700286+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "developer-tools", "ai-products", "ai-infrastructure"], "entities": ["FastAPI", "Ollama", "Qwen2.5-Coder-3B-Instruct", "Hetzner", "GitHub", "SQLite", "Caddy", "de-swarm"], "alternates": {"html": "https://wpnews.pro/news/shipping-a-local-llm-api-with-fastapi-and-ollama", "markdown": "https://wpnews.pro/news/shipping-a-local-llm-api-with-fastapi-and-ollama.md", "text": "https://wpnews.pro/news/shipping-a-local-llm-api-with-fastapi-and-ollama.txt", "jsonld": "https://wpnews.pro/news/shipping-a-local-llm-api-with-fastapi-and-ollama.jsonld"}}