# Fine-Tuning Transformers Vs Lora Vs Qlora 2024

> Source: <https://dev.to/samchenreviews/fine-tuning-transformers-vs-lora-vs-qlora-2024-l2>
> Published: 2026-06-12 16:21:57+00:00

Hey folks, Nick Creighton here. If you’ve been listening to the latest *Build Log* episode you know I’m all about shipping code that actually moves the needle. In this post I’m taking the audio‑first conversation we just had and turning it into a step‑by‑step guide you can read, bookmark, and act on.

Just a year ago the default way to adapt a large language model (LLM) was to **re‑train every single parameter**. That meant pulling a ggml or torch checkpoint, slamming a GPU farm together, and waiting hours (or days) for a new .bin file to appear. The result? A massive artifact that cost you in storage, latency, and maintenance.

Fast‑forward to today: **LoRA** and its cousin **QLoRA** have become the de‑facto tools for most production teams. They let you *add a tiny adapter* to a frozen model, keep the base weights untouched, and ship a few megabytes of delta‑weights instead of gigabytes of new model. The upside is immediate – lower cost, faster iteration, and a simpler deployment pipeline.

Below you’ll find the practical, no‑fluff details you need to decide which approach fits your project, how to set it up, and which pitfalls to avoid.

When I first started experimenting with LLMs, the workflow looked like this:

That’s all well and good if you have a dedicated server farm and a budget that looks like a startup’s seed round. But for most indie developers, SaaS founders, or data‑science hobbyists, the **cost‑to‑value ratio** is terrible.

When It Still Makes Sense

LoRA is essentially a **matrix factorisation** trick. Instead of updating the full weight matrix W ∈ ℝ^{d×k}, you freeze W and learn two much smaller matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k} such that ΔW = A·B. The rank r is typically 4–64 – orders of magnitude smaller than d or k.

Why It Works

Getting Started with LoRA

lora_cfg = LoraConfig(

r=32, # rank

lora_alpha=64,

target_modules=["q_proj", "v_proj"], # typical for Llama

lora_dropout=0.05,

bias="none",

)

model = get_peft_model(base_model, lora_cfg)

Actionable Tip #1 – Keep the Rank Low, Then Scale

If you’re not sure about the optimal rank, start with r=8. Evaluate on a validation set. If performance plateaus, bump to 16 or 32. The memory overhead grows linearly with r, so you’ll instantly see the trade‑off.

QLoRA builds on LoRA by **quantising the base model to 4‑bits (or 8‑bits) using bitsandbytes** while still training the adapter in full‑precision. The result: you can fine‑tune a 70‑B model on a single 24 GB GPU!

Key Benefits

**Quality Retention**: Empirical studies (including my own benchmarks) show Setting Up QLoRA

Install bitsandbytes (CUDA‑compatible version).

pip install bitsandbytes

Load the model with load_in_4bit=True and set bnb_4bit_compute_dtype=torch.float16.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(

"meta-llama/Llama-2-70b-chat-hf",

device_map="auto",

load_in_4bit=True,

bnb_4bit_compute_dtype=torch.float16,

bnb_4bit_quant_type="nf4"

)

Actionable Tip #2 – Use nf4 Quantisation for Better Stability

Bitsandbytes offers two 4‑bit schemes: fp4 and nf4. nf4 (normalised float‑4) tends to preserve the distribution of weights better, which translates to less “catastrophic forgetting” during LoRA training. If you hit sudden spikes in loss, switch to nf4.

```
  Criterion
  Full‑Model FT
  LoRA
  QLoRA

GPU Budget
  Multiple A100‑40G or V100‑32G
  Single RTX 4090 / A6000
  Single 24 GB GPU (RTX 4090, A6000)

Model Size
  Up to ~13 B comfortably
  Any size (adapter tiny)
  Up to 70 B (quantised)

Deployment Complexity
  High – new artifact, versioning
  Low – swap adapters
  Low – same as LoRA, but smaller runtime

Performance Gap vs Full‑FT
  0 % (baseline)
  ~2–5 % on average
  ~1–3 % on average

Use‑Case Fit
  Token‑embedding changes, architecture tweaks
  Domain‑specific chat, classification, summarisation
  Large‑scale embeddings, retrieval‑augmented generation, heavy traffic services
```

Below is a minimal script that works for both LoRA and QLoRA. Swap the load_in_4bit flag to toggle.

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

from peft import LoraConfig, get_peft_model

from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(

model_name,

device_map="auto",

torch_dtype=torch.float16,

# Uncomment the next three lines for QLoRA

# load_in_4bit=True,

# bnb_4bit_compute_dtype=torch.float16,

# bnb_4bit_quant_type="nf4",

)

lora_cfg = LoraConfig(

r=32,

lora_alpha=64,

target_modules=["q_proj", "v_proj"],

lora_dropout=0.05,

bias="none",

)

model = get_peft_model(model, lora_cfg)

data = load_dataset("json", data_files={"train": "train.jsonl", "valid": "valid.jsonl"})

def tokenize_fn(example):

tokens = tokenizer(example["prompt"], truncation=True, max_length=512)

tokens["labels"] = tokenizer(example["completion"], truncation=True, max_length=512)["input_ids"]

return tokens

tokenized = data.map(tokenize_fn, batched=True, remove_columns=["prompt", "completion"])

training_args = TrainingArguments(

output_dir="outputs",

per_device_train_batch_size=4,

gradient_accumulation_steps=4,

num_train_epochs=3,

learning_rate=2e-4,

fp16=True,

logging_steps=20,

save_steps=200,

evaluation_strategy="steps",

eval_steps=100,

load_best_model_at_end=True,

)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=tokenized["train"],

eval_dataset=tokenized["valid"],

)

trainer.train()

model.save_pretrained("my_adapter")

tokenizer.save_pretrained("my_adapter")

print("✅ Training complete – adapter saved!")

Actionable Tip #3 – Use Gradient Accumulation to Fit Bigger Batches

Even on a 24 GB card you can simulate a batch size of 32–64 by setting per_device_train_batch_size=4 and gradient_accumulation_steps=8. Larger effective batches improve stability, especially with low‑rank adapters.

Those numbers assume [standard on‑demand pricing](https://cloud.google.com/compute/pricing) and a modest 10 GB dataset. The takeaway: **you can get production‑grade results for pennies.**

To prove the point, here are three production pipelines I’ve deployed on *thirteen* active websites:

*Adapted from an episode of Signal Notes. Listen on your favorite podcast app.*
