Fine Tune Llama 3 For Sql Query Generation Tutorial Nick Creighton from Build Log demonstrated how to fine-tune Llama 3 for production-ready SQL query generation for under $50. Using a single RTX 4090 and a curated set of 200-500 natural-language-to-SQL pairs, the resulting model reliably writes SELECT statements for internal analytics dashboards. The process addresses issues with generic models such as hallucinating schema elements and exposing sensitive data. Fine‑Tuning Llama 3 for Production‑Ready SQL Generation Under $50 Hey, it’s Nick Creighton from Build Log . In the latest episode of Signal Notes I walked through how I took a fresh Llama 3 checkpoint, threw a few hundred annotated queries at it, and ended up with a model that reliably writes SELECT statements for our internal analytics dashboard—all for less than the price of a lunch. If you’ve ever tried to get a generic chatbot to spit out SQL and ended up chasing phantom columns or a 40 % data‑drift nightmare, you’ll know why you need a custom model. Below is the step‑by‑step companion guide that turns the audio discussion into a hands‑on playbook. Why a Generic Text‑to‑SQL Model Is a Liability Let’s recap the hook from the episode: - Generic LLMs hallucinate schema elements e.g., fiscal quarter that doesn’t exist . - They expose sensitive table or column names in prompts, a compliance red flag. - Business‑specific jargon “top line” never makes it into a pre‑trained model. - OpenAI‑style API pricing can balloon when you start running hundreds of queries a day. Bottom line: you own the data, you own the model, you own the risk. What You’ll Need Before You Start - Hardware : A single RTX 4090 https://www.nvidia.com/en-us/data-center/gtx-4090/ or any 24 GB+ GPU will do. If you don’t have one, look into Paperspace https://www.paperspace.com/ or Google Vertex AI https://cloud.google.com/vertex-ai spot instances – they run under $0.30/hr. - Software : python =3.10, torch CUDA‑enabled , transformers v4.40+ , datasets, and peft for parameter‑efficient fine‑tuning. - Data : A curated set of SQL‑in‑context examples from your own warehouses – ideally 200–500 pairs of natural‑language prompt → SQL query . - Budget Tracker : Keep a simple budget.txt file; we’ll update it after each cloud‑run to stay under $50. Step 1: Pull a Clean Llama 3 Checkpoint The Meta‑Llama‑3‑8B https://huggingface.co/meta-llama/Meta-Llama-3-8B checkpoint is free for commercial use as of the 2024 license . Grab the gguf version if you want to run inference on CPU later. git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B https://huggingface.co/meta-llama/Meta-Llama-3-8B cd Meta-Llama-3-8B Verify the download size ~15 GB and checksum – you don’t want a corrupted model messing with your fine‑tuning loss. Step 2: Assemble a Domain‑Specific SQL Corpus Here’s the practical part. Your model will only be as good as the examples you feed it. - Export Schema : Run pg dump -s -U your user your db schema.sql PostgreSQL or the equivalent for Snowflake/Redshift. - Harvest Queries : Pull the last 30 days of production queries from your query‑audit table, anonymize any PII, and store them in queries raw.sql. - Pair With NL : For each query, write a concise natural‑language description 30‑80 words . If you have a product manager, ask them to phrase the question as a business user would. - Normalize : Ensure every pair follows the same JSON schema: { "instruction": "Show me quarterly gross revenue for the US region.", "input": null, "output": "SELECT fiscal year, SUM gross revenue FROM sales WHERE region = 'US' GROUP BY fiscal year;" } - Validate : Run each output against a test database. Keep only the queries that return without errors. This step eliminates “syntactically correct but semantically wrong” examples. Goal: 300 high‑quality pairs. If you’re short, generate synthetic variations with ChatGPT but treat them as “augmented” – still validate . Step 3: Create a Hugging Face datasets Object from datasets import Dataset import json, pathlib data path = pathlib.Path "sql corpus.jsonl" records = json.loads line for line in data path.open dataset = Dataset.from dict { "instruction": r "instruction" for r in records , "output": r "output" for r in records , } dataset = dataset.train test split test size=0.1, seed=42 Splitting 90/10 gives you a quick validation set to watch over‑fitting. Step 4: Prompt Engineering – The “Instruction‑Tuning” Wrapper Llama 3 expects a chat‑style conversation. Wrap each example in a system‑prompt that tells the model it’s a “SQL‑assistant”. system prompt = "You are an expert SQL assistant for the Acme Analytics team. " "Answer only with a valid PostgreSQL query. Do not add explanations unless asked." def format example instruction, output : return {"role": "system", "content": system prompt}, {"role": "user", "content": instruction}, {"role": "assistant", "content": output} This consistent wrapper dramatically improves downstream performance. Step 5: Parameter‑Efficient Fine‑Tuning PEFT with LoRA Full fine‑tuning of 8 B parameters would be a waste of GPU memory. LoRA Low‑Rank Adaptation lets us train only a few thousand extra weights , keeping VRAM usage under 20 GB. pip install peft accelerate from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get peft model model = AutoModelForCausalLM.from pretrained "Meta-Llama-3-8B", device map="auto", torch dtype="auto" tokenizer = AutoTokenizer.from pretrained "Meta-Llama-3-8B" tokenizer.pad token = tokenizer.eos token lora cfg = LoraConfig r=16, lora alpha=32, target modules= "q proj", "k proj", "v proj", "o proj" , lora dropout=0.05, bias="none", task type="CAUSAL LM" model = get peft model model, lora cfg model.print trainable parameters The output will show ~0.2 % of parameters are trainable – perfect for a $0.30/hr spot instance. Step 6: Training Loop – Keep an Eye on Cost We’ll use accelerate to handle distributed training automatically. Set gradient accumulation steps so the effective batch size stays around 32, which balances speed and GPU memory. from accelerate import Accelerator from transformers import Trainer, TrainingArguments accelerator = Accelerator training args = TrainingArguments output dir="llama3-sql", per device train batch size=4, per device eval batch size=4, gradient accumulation steps=8, learning rate=2e-5, num train epochs=3, fp16=True, logging steps=20, evaluation strategy="steps", eval steps=100, save steps=200, save total limit=2, report to="none" def data collator features : batch = tokenizer.pad {"input ids": f "input ids" for f in features , "labels": f "labels" for f in features }, padding=True, return tensors="pt" batch "labels" batch "labels" == tokenizer.pad token id = -100 return batch trainer = Trainer model=model, args=training args, train dataset=dataset "train" , eval dataset=dataset "test" , tokenizer=tokenizer, data collator=data collator, trainer.train Cost‑tracking tip : After each training run, run accelerate status and record the total GPU‑hours in budget.txt. With a 3‑epoch run on a single RTX 4090, you should stay under $12. Step 7: Quick Evaluation – The “Real‑World” Test Suite Training loss is nice, but we need to know if the model actually respects your schema. - Static Tests : Feed 20 held‑out natural‑language prompts and compare the generated SQL string to the ground truth using difflib.SequenceMatcher. Aim for 85 % exact match. - Execution Tests : Run each generated query against a sandbox copy of your DB. Capture rowcount and error message. Flag any “column not found” or “syntax error” failures. - Security Scan : Verify that no table or column names appear in the prompt side‑channel e.g., user‑provided text . If they do, strip them out before sending to the model. Below is a tiny helper you can drop into a notebook: def eval sql prompt : messages = format example prompt, "" input ids = tokenizer.apply chat template messages, return tensors="pt" output = model.generate input ids.to model.device , max new tokens=200, temperature=0.0, do sample=False sql = tokenizer.decode output 0 , skip special tokens=True return sql.strip Step 8: Deploying the Model – From Notebook to Production API We’ll spin up a FastAPI https://fastapi.tiangolo.com/ wrapper and serve it via ngrok https://github.com/ngrok/ngrok for internal testing. When you’re ready, push the container to ECR https://aws.amazon.com/elasticcontainerregistry/ and run it on AWS Fargate or GCP Cloud Run . All of this can be done for under $20/month. fastapi app.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel app = FastAPI class QueryRequest BaseModel : question: str @app.post "/generate-sql" def generate sql req: QueryRequest : try: sql = eval sql req.question return {"sql": sql} except Exception as e: raise HTTPException status code=500, detail=str e Build a Dockerfile that copies the model weights, installs torch and transformers, and runs uvicorn fastapi app:app --host 0.0.0.0 --port 8080. Keep the container size lean use python:3.11-slim as base . Step 9: Monitoring & Continuous Improvement - Telemetry : Log every incoming question, the generated SQL, and a boolean execution success flag from your sandbox runner. - Feedback Loop : If a query fails, automatically add the question, corrected‑SQL pair to a retrain queue.jsonl file. Schedule a nightly trainer.train run to ingest the new data. - Cost Guardrails : Set a CloudWatch alarm or GCP Monitoring that triggers when GPU usage exceeds $5 in any 24‑hour window. The alarm can spin down the spot instance automatically. Iterating in this fashion turns a one‑off fine‑tune into a living, self‑healing service. Step 10: Security & Compliance Checklist - Encrypt model checkpoints at rest use AWS KMS or GCP CMEK . - Never log raw user prompts that contain PII; hash them instead. - Restrict API access to internal IP ranges or via a VPN. - Run a quarterly “model‑drift” audit: compare the model’s schema coverage against the latest information schema dump. Real‑World Results – What I Saw After Deploying Within the first week, my team ran 1,240 generated queries. The success rate queries that executed without error was 93 %. The remaining 7 % were edge‑cases where the model needed a tiny business rule tweak e.g., “exclude test accounts” . Adding those 15 new examples to the training set bumped the success rate to 96 % after a quick 30‑minute re‑train. Cost breakdown: - Initial fine‑tune on a spot GPU: $11.20 - Docker host on Fargate 2 vCPU, 4 GB RAM : $17.60/month - Data‑warehouse sandbox for validation: $5.20/month - Total first‑month spend: $33.80 That’s well under the “under $50” promise and dramatically cheaper than a commercial Text‑to‑SQL SaaS that charges per‑query. Actionable Checklist Copy‑Paste into Your Project Management Tool - Spin up a 24 GB GPU RTX 4090 or Spot Instance - Clone Llama 3 checkpoint - Export schema & harvest last‑30‑day queries - Write NL descriptions for each query 300 pairs minimum - Validate each query against a sandbox DB This article continues on our podcast...