Fine-Tuning Transformers Vs Lora Vs Qlora 2024

Nick Creighton, a developer and host of the *Build Log* podcast, has published a practical comparison of full-model fine-tuning, LoRA, and QLoRA for adapting large language models in 2024. The guide details how LoRA and QLoRA have largely replaced full retraining in production by using small, low-cost adapters that keep base model weights frozen, drastically reducing storage, latency, and GPU requirements. Creighton provides actionable setup instructions and benchmarks showing that QLoRA can fine-tune a 70-billion-parameter model on a single 24 GB GPU with only a 1–3% performance gap compared to full fine-tuning.

Hey folks, Nick Creighton here. If you’ve been listening to the latest Build Log episode you know I’m all about shipping code that actually moves the needle. In this post I’m taking the audio‑first conversation we just had and turning it into a step‑by‑step guide you can read, bookmark, and act on. Just a year ago the default way to adapt a large language model LLM was to re‑train every single parameter . That meant pulling a ggml or torch checkpoint, slamming a GPU farm together, and waiting hours or days for a new .bin file to appear. The result? A massive artifact that cost you in storage, latency, and maintenance. Fast‑forward to today: LoRA and its cousin QLoRA have become the de‑facto tools for most production teams. They let you add a tiny adapter to a frozen model, keep the base weights untouched, and ship a few megabytes of delta‑weights instead of gigabytes of new model. The upside is immediate – lower cost, faster iteration, and a simpler deployment pipeline. Below you’ll find the practical, no‑fluff details you need to decide which approach fits your project, how to set it up, and which pitfalls to avoid. When I first started experimenting with LLMs, the workflow looked like this: That’s all well and good if you have a dedicated server farm and a budget that looks like a startup’s seed round. But for most indie developers, SaaS founders, or data‑science hobbyists, the cost‑to‑value ratio is terrible. When It Still Makes Sense LoRA is essentially a matrix factorisation trick. Instead of updating the full weight matrix W ∈ ℝ^{d×k}, you freeze W and learn two much smaller matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k} such that ΔW = A·B. The rank r is typically 4–64 – orders of magnitude smaller than d or k. Why It Works Getting Started with LoRA lora cfg = LoraConfig r=32, rank lora alpha=64, target modules= "q proj", "v proj" , typical for Llama lora dropout=0.05, bias="none", model = get peft model base model, lora cfg Actionable Tip 1 – Keep the Rank Low, Then Scale If you’re not sure about the optimal rank, start with r=8. Evaluate on a validation set. If performance plateaus, bump to 16 or 32. The memory overhead grows linearly with r, so you’ll instantly see the trade‑off. QLoRA builds on LoRA by quantising the base model to 4‑bits or 8‑bits using bitsandbytes while still training the adapter in full‑precision. The result: you can fine‑tune a 70‑B model on a single 24 GB GPU Key Benefits Quality Retention : Empirical studies including my own benchmarks show Setting Up QLoRA Install bitsandbytes CUDA‑compatible version . pip install bitsandbytes Load the model with load in 4bit=True and set bnb 4bit compute dtype=torch.float16. from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from pretrained "meta-llama/Llama-2-70b-chat-hf", device map="auto", load in 4bit=True, bnb 4bit compute dtype=torch.float16, bnb 4bit quant type="nf4" Actionable Tip 2 – Use nf4 Quantisation for Better Stability Bitsandbytes offers two 4‑bit schemes: fp4 and nf4. nf4 normalised float‑4 tends to preserve the distribution of weights better, which translates to less “catastrophic forgetting” during LoRA training. If you hit sudden spikes in loss, switch to nf4. Criterion Full‑Model FT LoRA QLoRA GPU Budget Multiple A100‑40G or V100‑32G Single RTX 4090 / A6000 Single 24 GB GPU RTX 4090, A6000 Model Size Up to ~13 B comfortably Any size adapter tiny Up to 70 B quantised Deployment Complexity High – new artifact, versioning Low – swap adapters Low – same as LoRA, but smaller runtime Performance Gap vs Full‑FT 0 % baseline ~2–5 % on average ~1–3 % on average Use‑Case Fit Token‑embedding changes, architecture tweaks Domain‑specific chat, classification, summarisation Large‑scale embeddings, retrieval‑augmented generation, heavy traffic services Below is a minimal script that works for both LoRA and QLoRA. Swap the load in 4bit flag to toggle. import torch from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments from peft import LoraConfig, get peft model from datasets import load dataset model name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from pretrained model name model = AutoModelForCausalLM.from pretrained model name, device map="auto", torch dtype=torch.float16, Uncomment the next three lines for QLoRA load in 4bit=True, bnb 4bit compute dtype=torch.float16, bnb 4bit quant type="nf4", lora cfg = LoraConfig r=32, lora alpha=64, target modules= "q proj", "v proj" , lora dropout=0.05, bias="none", model = get peft model model, lora cfg data = load dataset "json", data files={"train": "train.jsonl", "valid": "valid.jsonl"} def tokenize fn example : tokens = tokenizer example "prompt" , truncation=True, max length=512 tokens "labels" = tokenizer example "completion" , truncation=True, max length=512 "input ids" return tokens tokenized = data.map tokenize fn, batched=True, remove columns= "prompt", "completion" training args = TrainingArguments output dir="outputs", per device train batch size=4, gradient accumulation steps=4, num train epochs=3, learning rate=2e-4, fp16=True, logging steps=20, save steps=200, evaluation strategy="steps", eval steps=100, load best model at end=True, trainer = Trainer model=model, args=training args, train dataset=tokenized "train" , eval dataset=tokenized "valid" , trainer.train model.save pretrained "my adapter" tokenizer.save pretrained "my adapter" print "✅ Training complete – adapter saved " Actionable Tip 3 – Use Gradient Accumulation to Fit Bigger Batches Even on a 24 GB card you can simulate a batch size of 32–64 by setting per device train batch size=4 and gradient accumulation steps=8. Larger effective batches improve stability, especially with low‑rank adapters. Those numbers assume standard on‑demand pricing https://cloud.google.com/compute/pricing and a modest 10 GB dataset. The takeaway: you can get production‑grade results for pennies. To prove the point, here are three production pipelines I’ve deployed on thirteen active websites: Adapted from an episode of Signal Notes. Listen on your favorite podcast app.