#
Fine‑Tuning Llama 3 for Production‑Ready SQL Generation (Under $50) Hey, it’s Nick Creighton from Build Log. In the latest episode of Signal Notes I walked through how I took a fresh Llama 3 checkpoint, threw a few hundred annotated queries at it, and ended up with a model that reliably writes SELECT statements for our internal analytics dashboard—all for less than the price of a lunch. If you’ve ever tried to get a generic chatbot to spit out SQL and ended up chasing phantom columns or a 40 % data‑drift nightmare, you’ll know why you need a custom model. Below is the step‑by‑step companion guide that turns the audio discussion into a hands‑on playbook. ### Why a Generic Text‑to‑SQL Model Is a Liability Let’s recap the hook from the episode: - Generic LLMs hallucinate schema elements (e.g., fiscal_quarter that doesn’t exist). - They expose sensitive table or column names in prompts, a compliance red flag. - Business‑specific jargon (“top line”) never makes it into a pre‑trained model. - OpenAI‑style API pricing can balloon when you start running hundreds of queries a day. Bottom line: you own the data, you own the model, you own the risk. ### What You’ll Need Before You Start - Hardware: A single RTX 4090 (or any 24 GB+ GPU) will do. If you don’t have one, look into Paperspace or Google Vertex AI spot instances – they run under $0.30/hr. - Software: python>=3.10, torch (CUDA‑enabled), transformers (v4.40+), datasets, and peft for parameter‑efficient fine‑tuning. - Data: A curated set of SQL‑in‑context examples from your own warehouses – ideally 200–500 pairs of natural‑language prompt → SQL query. - Budget Tracker: Keep a simple budget.txt file; we’ll update it after each cloud‑run to stay under $50. ### Step 1: Pull a Clean Llama 3 Checkpoint The Meta‑Llama‑3‑8B checkpoint is free for commercial use (as of the 2024 license). Grab the gguf version if you want to run inference on CPU later. git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B cd Meta-Llama-3-8B Verify the download size (~15 GB) and checksum – you don’t want a corrupted model messing with your fine‑tuning loss. ### Step 2: Assemble a Domain‑Specific SQL Corpus Here’s the practical part. Your model will only be as good as the examples you feed it. - Export Schema: Run pg_dump -s -U your_user your_db > schema.sql (PostgreSQL) or the equivalent for Snowflake/Redshift. - Harvest Queries: Pull the last 30 days of production queries from your query‑audit table, anonymize any PII, and store them in queries_raw.sql. - Pair With NL: For each query, write a concise natural‑language description (30‑80 words). If you have a product manager, ask them to phrase the question as a business user would. - Normalize: Ensure every pair follows the same JSON schema: { "instruction": "Show me quarterly gross revenue for the US region.", "input": null, "output": "SELECT fiscal_year, SUM(gross_revenue) FROM sales WHERE region = 'US' GROUP BY fiscal_year;" } - Validate: Run each output against a test database. Keep only the queries that return without errors. This step eliminates “syntactically correct but semantically wrong” examples. Goal: 300 high‑quality pairs. If you’re short, generate synthetic variations with ChatGPT (but treat them as “augmented” – still validate). ### Step 3: Create a Hugging Face datasets Object from datasets import Dataset import json, pathlib data_path = pathlib.Path("sql_corpus.jsonl") records = [json.loads(line) for line in data_path.open()] dataset = Dataset.from_dict({ "instruction": [r["instruction"] for r in records], "output": [r["output"] for r in records], }) dataset = dataset.train_test_split(test_size=0.1, seed=42) Splitting 90/10 gives you a quick validation set to watch over‑fitting. ### Step 4: Prompt Engineering – The “Instruction‑Tuning” Wrapper Llama 3 expects a chat‑style conversation. Wrap each example in a system‑prompt that tells the model it’s a “SQL‑assistant”. system_prompt = ( "You are an expert SQL assistant for the Acme Analytics team. " "Answer only with a valid PostgreSQL query. Do not add explanations unless asked." ) def format_example(instruction, output): return [ {"role": "system", "content": system_prompt}, {"role": "user", "content": instruction}, {"role": "assistant", "content": output} ] This consistent wrapper dramatically improves downstream performance. ### Step 5: Parameter‑Efficient Fine‑Tuning (PEFT) with LoRA Full fine‑tuning of 8 B parameters would be a waste of GPU memory. LoRA (Low‑Rank Adaptation) lets us train only a few thousand extra weights, keeping VRAM usage under 20 GB. pip install peft accelerate from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model model = AutoModelForCausalLM.from_pretrained( "Meta-Llama-3-8B", device_map="auto", torch_dtype="auto" ) tokenizer = AutoTokenizer.from_pretrained("Meta-Llama-3-8B") tokenizer.pad_token = tokenizer.eos_token lora_cfg = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_cfg) model.print_trainable_parameters() The output will show ~0.2 % of parameters are trainable – perfect for a $0.30/hr spot instance. ### Step 6: Training Loop – Keep an Eye on Cost We’ll use accelerate to handle distributed training automatically. Set gradient_accumulation_steps so the effective batch size stays around 32, which balances speed and GPU memory. from accelerate import Accelerator from transformers import Trainer, TrainingArguments accelerator = Accelerator() training_args = TrainingArguments( output_dir="llama3-sql", per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=8, learning_rate=2e-5, num_train_epochs=3, fp16=True, logging_steps=20, evaluation_strategy="steps", eval_steps=100, save_steps=200, save_total_limit=2, report_to="none" ) def data_collator(features): batch = tokenizer.pad( {"input_ids": [f["input_ids"] for f in features], "labels": [f["labels"] for f in features]}, padding=True, return_tensors="pt" ) batch["labels"][batch["labels"] == tokenizer.pad_token_id] = -100 return batch trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=tokenizer, data_collator=data_collator, ) trainer.train() Cost‑tracking tip: After each training run, run accelerate status and record the total GPU‑hours in budget.txt. With a 3‑epoch run on a single RTX 4090, you should stay under $12. ### Step 7: Quick Evaluation – The “Real‑World” Test Suite Training loss is nice, but we need to know if the model actually respects your schema. - Static Tests: Feed 20 held‑out natural‑language prompts and compare the generated SQL string to the ground truth using difflib.SequenceMatcher. Aim for > 85 % exact match. - Execution Tests: Run each generated query against a sandbox copy of your DB. Capture rowcount and error_message. Flag any “column not found” or “syntax error” failures. - Security Scan: Verify that no table or column names appear in the prompt side‑channel (e.g., user‑provided text). If they do, strip them out before sending to the model. Below is a tiny helper you can drop into a notebook: def eval_sql(prompt): messages = format_example(prompt, "") input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt") output = model.generate( input_ids.to(model.device), max_new_tokens=200, temperature=0.0, do_sample=False ) sql = tokenizer.decode(output[0], skip_special_tokens=True) return sql.strip() ### Step 8: Deploying the Model – From Notebook to Production API We’ll spin up a FastAPI wrapper and serve it via ngrok for internal testing. When you’re ready, push the container to ECR and run it on AWS Fargate (or GCP Cloud Run). All of this can be done for under $20/month. # fastapi_app.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel app = FastAPI() class QueryRequest(BaseModel): question: str @app.post("/generate-sql") def generate_sql(req: QueryRequest): try: sql = eval_sql(req.question) return {"sql": sql} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) Build a Dockerfile that copies the model weights, installs torch and transformers, and runs uvicorn fastapi_app:app --host 0.0.0.0 --port 8080. Keep the container size lean (use python:3.11-slim as base). ### Step 9: Monitoring & Continuous Improvement - Telemetry: Log every incoming question, the generated SQL, and a boolean execution_success flag from your sandbox runner. - Feedback Loop: If a query fails, automatically add the (question, corrected‑SQL) pair to a retrain_queue.jsonl file. Schedule a nightly trainer.train() run to ingest the new data. - Cost Guardrails: Set a CloudWatch alarm (or GCP Monitoring) that triggers when GPU usage exceeds $5 in any 24‑hour window. The alarm can spin down the spot instance automatically. Iterating in this fashion turns a one‑off fine‑tune into a living, self‑healing service. ### Step 10: Security & Compliance Checklist - Encrypt model checkpoints at rest (use AWS KMS or GCP CMEK). - Never log raw user prompts that contain PII; hash them instead. - Restrict API access to internal IP ranges or via a VPN. - Run a quarterly “model‑drift” audit: compare the model’s schema coverage against the latest information_schema dump. ### Real‑World Results – What I Saw After Deploying Within the first week, my team ran 1,240 generated queries. The success rate (queries that executed without error) was 93 %. The remaining 7 % were edge‑cases where the model needed a tiny business rule tweak (e.g., “exclude test accounts”). Adding those 15 new examples to the training set bumped the success rate to 96 % after a quick 30‑minute re‑train. Cost breakdown: - Initial fine‑tune on a spot GPU: $11.20 - Docker host on Fargate (2 vCPU, 4 GB RAM): $17.60/month - Data‑warehouse sandbox for validation: $5.20/month - Total first‑month spend: $33.80 That’s well under the “under $50” promise and dramatically cheaper than a commercial Text‑to‑SQL SaaS that charges per‑query. ### Actionable Checklist (Copy‑Paste into Your Project Management Tool) - [ ] Spin up a 24 GB GPU (RTX 4090 or Spot Instance) - [ ] Clone Llama 3 checkpoint - [ ] Export schema & harvest last‑30‑day queries - [ ] Write NL descriptions for each query (300 pairs minimum) - [ ] Validate each query against a sandbox DB
This article continues on our podcast...