# Fine Tune Llama 3 For Sql Query Generation Tutorial

> Source: <https://dev.to/samchenreviews/fine-tune-llama-3-for-sql-query-generation-tutorial-35hi>
> Published: 2026-06-15 14:01:53+00:00

##
Fine‑Tuning Llama 3 for Production‑Ready SQL Generation (Under $50) Hey, it’s Nick Creighton from *Build Log*. In the latest episode of **Signal Notes** I walked through how I took a fresh Llama 3 checkpoint, threw a few hundred annotated queries at it, and ended up with a model that reliably writes SELECT statements for our internal analytics dashboard—all for less than the price of a lunch. If you’ve ever tried to get a generic chatbot to spit out SQL and ended up chasing phantom columns or a 40 % data‑drift nightmare, you’ll know why you need a custom model. Below is the step‑by‑step companion guide that turns the audio discussion into a hands‑on playbook. ### Why a Generic Text‑to‑SQL Model Is a Liability Let’s recap the hook from the episode: - Generic LLMs hallucinate schema elements (e.g., fiscal_quarter that doesn’t exist). - They expose sensitive table or column names in prompts, a compliance red flag. - Business‑specific jargon (“top line”) never makes it into a pre‑trained model. - OpenAI‑style API pricing can balloon when you start running hundreds of queries a day. Bottom line: you *own* the data, you *own* the model, you *own* the risk. ### What You’ll Need Before You Start - **Hardware**: A single [RTX 4090](https://www.nvidia.com/en-us/data-center/gtx-4090/) (or any 24 GB+ GPU) will do. If you don’t have one, look into [Paperspace](https://www.paperspace.com/) or [Google Vertex AI](https://cloud.google.com/vertex-ai) spot instances – they run under $0.30/hr. - **Software**: python>=3.10, torch (CUDA‑enabled), transformers (v4.40+), datasets, and peft for parameter‑efficient fine‑tuning. - **Data**: A curated set of **SQL‑in‑context** examples from your own warehouses – ideally 200–500 pairs of *natural‑language prompt → SQL query*. - **Budget Tracker**: Keep a simple budget.txt file; we’ll update it after each cloud‑run to stay under $50. ### Step 1: Pull a Clean Llama 3 Checkpoint The [Meta‑Llama‑3‑8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) checkpoint is free for commercial use (as of the 2024 license). Grab the gguf version if you want to run inference on CPU later. git lfs install git clone [https://huggingface.co/meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) cd Meta-Llama-3-8B Verify the download size (~15 GB) and checksum – you don’t want a corrupted model messing with your fine‑tuning loss. ### Step 2: Assemble a Domain‑Specific SQL Corpus Here’s the practical part. Your model will only be as good as the examples you feed it. - **Export Schema**: Run pg_dump -s -U your_user your_db > schema.sql (PostgreSQL) or the equivalent for Snowflake/Redshift. - **Harvest Queries**: Pull the last 30 days of production queries from your query‑audit table, anonymize any PII, and store them in queries_raw.sql. - **Pair With NL**: For each query, write a concise natural‑language description (30‑80 words). If you have a product manager, ask them to phrase the question as a business user would. - **Normalize**: Ensure every pair follows the same JSON schema: { "instruction": "Show me quarterly gross revenue for the US region.", "input": null, "output": "SELECT fiscal_year, SUM(gross_revenue) FROM sales WHERE region = 'US' GROUP BY fiscal_year;" } - **Validate**: Run each output against a test database. Keep only the queries that return without errors. This step eliminates “syntactically correct but semantically wrong” examples. Goal: 300 high‑quality pairs. If you’re short, generate synthetic variations with ChatGPT (but treat them as “augmented” – still validate). ### Step 3: Create a Hugging Face datasets Object from datasets import Dataset import json, pathlib data_path = pathlib.Path("sql_corpus.jsonl") records = [json.loads(line) for line in data_path.open()] dataset = Dataset.from_dict({ "instruction": [r["instruction"] for r in records], "output": [r["output"] for r in records], }) dataset = dataset.train_test_split(test_size=0.1, seed=42) Splitting 90/10 gives you a quick validation set to watch over‑fitting. ### Step 4: Prompt Engineering – The “Instruction‑Tuning” Wrapper Llama 3 expects a chat‑style conversation. Wrap each example in a system‑prompt that tells the model it’s a “SQL‑assistant”. system_prompt = ( "You are an expert SQL assistant for the Acme Analytics team. " "Answer only with a valid PostgreSQL query. Do not add explanations unless asked." ) def format_example(instruction, output): return [ {"role": "system", "content": system_prompt}, {"role": "user", "content": instruction}, {"role": "assistant", "content": output} ] This consistent wrapper dramatically improves downstream performance. ### Step 5: Parameter‑Efficient Fine‑Tuning (PEFT) with LoRA Full fine‑tuning of 8 B parameters would be a waste of GPU memory. LoRA (Low‑Rank Adaptation) lets us train **only a few thousand extra weights**, keeping VRAM usage under 20 GB. pip install peft accelerate from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model model = AutoModelForCausalLM.from_pretrained( "Meta-Llama-3-8B", device_map="auto", torch_dtype="auto" ) tokenizer = AutoTokenizer.from_pretrained("Meta-Llama-3-8B") tokenizer.pad_token = tokenizer.eos_token lora_cfg = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_cfg) model.print_trainable_parameters() The output will show ~0.2 % of parameters are trainable – perfect for a $0.30/hr spot instance. ### Step 6: Training Loop – Keep an Eye on Cost We’ll use accelerate to handle distributed training automatically. Set gradient_accumulation_steps so the effective batch size stays around 32, which balances speed and GPU memory. from accelerate import Accelerator from transformers import Trainer, TrainingArguments accelerator = Accelerator() training_args = TrainingArguments( output_dir="llama3-sql", per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=8, learning_rate=2e-5, num_train_epochs=3, fp16=True, logging_steps=20, evaluation_strategy="steps", eval_steps=100, save_steps=200, save_total_limit=2, report_to="none" ) def data_collator(features): batch = tokenizer.pad( {"input_ids": [f["input_ids"] for f in features], "labels": [f["labels"] for f in features]}, padding=True, return_tensors="pt" ) batch["labels"][batch["labels"] == tokenizer.pad_token_id] = -100 return batch trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=tokenizer, data_collator=data_collator, ) trainer.train() **Cost‑tracking tip**: After each training run, run accelerate status and record the total GPU‑hours in budget.txt. With a 3‑epoch run on a single RTX 4090, you should stay under $12. ### Step 7: Quick Evaluation – The “Real‑World” Test Suite Training loss is nice, but we need to know if the model actually respects your schema. - **Static Tests**: Feed 20 held‑out natural‑language prompts and compare the generated SQL string to the ground truth using difflib.SequenceMatcher. Aim for > 85 % exact match. - **Execution Tests**: Run each generated query against a sandbox copy of your DB. Capture rowcount and error_message. Flag any “column not found” or “syntax error” failures. - **Security Scan**: Verify that no table or column names appear in the *prompt* side‑channel (e.g., user‑provided text). If they do, strip them out before sending to the model. Below is a tiny helper you can drop into a notebook: def eval_sql(prompt): messages = format_example(prompt, "") input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt") output = model.generate( input_ids.to(model.device), max_new_tokens=200, temperature=0.0, do_sample=False ) sql = tokenizer.decode(output[0], skip_special_tokens=True) return sql.strip() ### Step 8: Deploying the Model – From Notebook to Production API We’ll spin up a [FastAPI](https://fastapi.tiangolo.com/) wrapper and serve it via [ngrok](https://github.com/ngrok/ngrok) for internal testing. When you’re ready, push the container to [ECR](https://aws.amazon.com/elasticcontainerregistry/) and run it on AWS Fargate (or GCP Cloud Run). All of this can be done for under $20/month. # fastapi_app.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel app = FastAPI() class QueryRequest(BaseModel): question: str @app.post("/generate-sql") def generate_sql(req: QueryRequest): try: sql = eval_sql(req.question) return {"sql": sql} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) Build a Dockerfile that copies the model weights, installs torch and transformers, and runs uvicorn fastapi_app:app --host 0.0.0.0 --port 8080. Keep the container size lean (use python:3.11-slim as base). ### Step 9: Monitoring & Continuous Improvement - **Telemetry**: Log every incoming question, the generated SQL, and a boolean execution_success flag from your sandbox runner. - **Feedback Loop**: If a query fails, automatically add the (question, corrected‑SQL) pair to a retrain_queue.jsonl file. Schedule a nightly trainer.train() run to ingest the new data. - **Cost Guardrails**: Set a CloudWatch alarm (or GCP Monitoring) that triggers when GPU usage exceeds $5 in any 24‑hour window. The alarm can spin down the spot instance automatically. Iterating in this fashion turns a one‑off fine‑tune into a living, self‑healing service. ### Step 10: Security & Compliance Checklist - Encrypt model checkpoints at rest (use AWS KMS or GCP CMEK). - Never log raw user prompts that contain PII; hash them instead. - Restrict API access to internal IP ranges or via a VPN. - Run a quarterly “model‑drift” audit: compare the model’s schema coverage against the latest information_schema dump. ### Real‑World Results – What I Saw After Deploying Within the first week, my team ran 1,240 generated queries. The success rate (queries that executed without error) was 93 %. The remaining 7 % were edge‑cases where the model needed a tiny business rule tweak (e.g., “exclude test accounts”). Adding those 15 new examples to the training set bumped the success rate to 96 % after a quick 30‑minute re‑train. Cost breakdown: - Initial fine‑tune on a spot GPU: $11.20 - Docker host on Fargate (2 vCPU, 4 GB RAM): $17.60/month - Data‑warehouse sandbox for validation: $5.20/month - **Total first‑month spend:** $33.80 That’s well under the “under $50” promise and dramatically cheaper than a commercial Text‑to‑SQL SaaS that charges per‑query. ### Actionable Checklist (Copy‑Paste into Your Project Management Tool) - [ ] Spin up a 24 GB GPU (RTX 4090 or Spot Instance) - [ ] Clone Llama 3 checkpoint - [ ] Export schema & harvest last‑30‑day queries - [ ] Write NL descriptions for each query (300 pairs minimum) - [ ] Validate each query against a sandbox DB

*This article continues on our podcast...*
