Hey folks, Nick Creighton here. If you’ve been listening to the latest Build Log episode you know I’m all about shipping code that actually moves the needle. In this post I’m taking the audio‑first conversation we just had and turning it into a step‑by‑step guide you can read, bookmark, and act on.
Just a year ago the default way to adapt a large language model (LLM) was to re‑train every single parameter. That meant pulling a ggml or torch checkpoint, slamming a GPU farm together, and waiting hours (or days) for a new .bin file to appear. The result? A massive artifact that cost you in storage, latency, and maintenance.
Fast‑forward to today: LoRA and its cousin QLoRA have become the de‑facto tools for most production teams. They let you add a tiny adapter to a frozen model, keep the base weights untouched, and ship a few megabytes of delta‑weights instead of gigabytes of new model. The upside is immediate – lower cost, faster iteration, and a simpler deployment pipeline.
Below you’ll find the practical, no‑fluff details you need to decide which approach fits your project, how to set it up, and which pitfalls to avoid.
When I first started experimenting with LLMs, the workflow looked like this:
That’s all well and good if you have a dedicated server farm and a budget that looks like a startup’s seed round. But for most indie developers, SaaS founders, or data‑science hobbyists, the cost‑to‑value ratio is terrible.
When It Still Makes Sense
LoRA is essentially a matrix factorisation trick. Instead of updating the full weight matrix W ∈ ℝ^{d×k}, you freeze W and learn two much smaller matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k} such that ΔW = A·B. The rank r is typically 4–64 – orders of magnitude smaller than d or k.
Why It Works
Getting Started with LoRA
lora_cfg = LoraConfig(
r=32, # rank
lora_alpha=64,
target_modules=["q_proj", "v_proj"], # typical for Llama
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(base_model, lora_cfg)
Actionable Tip #1 – Keep the Rank Low, Then Scale
If you’re not sure about the optimal rank, start with r=8. Evaluate on a validation set. If performance plateaus, bump to 16 or 32. The memory overhead grows linearly with r, so you’ll instantly see the trade‑off.
QLoRA builds on LoRA by quantising the base model to 4‑bits (or 8‑bits) using bitsandbytes while still training the adapter in full‑precision. The result: you can fine‑tune a 70‑B model on a single 24 GB GPU!
Key Benefits
Quality Retention: Empirical studies (including my own benchmarks) show Setting Up QLoRA
Install bitsandbytes (CUDA‑compatible version).
pip install bitsandbytes
Load the model with load_in_4bit=True and set bnb_4bit_compute_dtype=torch.float16.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-chat-hf",
device_map="auto",
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
Actionable Tip #2 – Use nf4 Quantisation for Better Stability
Bitsandbytes offers two 4‑bit schemes: fp4 and nf4. nf4 (normalised float‑4) tends to preserve the distribution of weights better, which translates to less “catastrophic forgetting” during LoRA training. If you hit sudden spikes in loss, switch to nf4.
Criterion
Full‑Model FT
LoRA
QLoRA
GPU Budget
Multiple A100‑40G or V100‑32G
Single RTX 4090 / A6000
Single 24 GB GPU (RTX 4090, A6000)
Model Size
Up to ~13 B comfortably
Any size (adapter tiny)
Up to 70 B (quantised)
Deployment Complexity
High – new artifact, versioning
Low – swap adapters
Low – same as LoRA, but smaller runtime
Performance Gap vs Full‑FT
0 % (baseline)
~2–5 % on average
~1–3 % on average
Use‑Case Fit
Token‑embedding changes, architecture tweaks
Domain‑specific chat, classification, summarisation
Large‑scale embeddings, retrieval‑augmented generation, heavy traffic services
Below is a minimal script that works for both LoRA and QLoRA. Swap the load_in_4bit flag to toggle.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16,
)
lora_cfg = LoraConfig(
r=32,
lora_alpha=64,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_cfg)
data = load_dataset("json", data_files={"train": "train.jsonl", "valid": "valid.jsonl"})
def tokenize_fn(example):
tokens = tokenizer(example["prompt"], truncation=True, max_length=512)
tokens["labels"] = tokenizer(example["completion"], truncation=True, max_length=512)["input_ids"]
return tokens
tokenized = data.map(tokenize_fn, batched=True, remove_columns=["prompt", "completion"])
training_args = TrainingArguments(
output_dir="outputs",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=20,
save_steps=200,
evaluation_strategy="steps",
eval_steps=100,
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["valid"],
)
trainer.train()
model.save_pretrained("my_adapter")
tokenizer.save_pretrained("my_adapter")
print("✅ Training complete – adapter saved!")
Actionable Tip #3 – Use Gradient Accumulation to Fit Bigger Batches
Even on a 24 GB card you can simulate a batch size of 32–64 by setting per_device_train_batch_size=4 and gradient_accumulation_steps=8. Larger effective batches improve stability, especially with low‑rank adapters.
Those numbers assume standard on‑demand pricing and a modest 10 GB dataset. The takeaway: you can get production‑grade results for pennies.
To prove the point, here are three production pipelines I’ve deployed on thirteen active websites:
Adapted from an episode of Signal Notes. Listen on your favorite podcast app.