Fine-Tuning Transformers Vs Lora Vs Qlora 2024

wpnews.pro

cd /news/large-language-models/fine-tuning-transformers-vs-lora-vs-… · home › topics › large-language-models › article

[ARTICLE · art-25397] src=dev.to ↗ pub=2026-06-12T16:21Z topic=large-language-models verified=true sentiment=↑ positive

Fine-Tuning Transformers Vs Lora Vs Qlora 2024

Nick Creighton, a developer and host of the *Build Log* podcast, has published a practical comparison of full-model fine-tuning, LoRA, and QLoRA for adapting large language models in 2024. The guide details how LoRA and QLoRA have largely replaced full retraining in production by using small, low-cost adapters that keep base model weights frozen, drastically reducing storage, latency, and GPU requirements. Creighton provides actionable setup instructions and benchmarks showing that QLoRA can fine-tune a 70-billion-parameter model on a single 24 GB GPU with only a 1–3% performance gap compared to full fine-tuning.

read4 min views25 publishedJun 12, 2026

Hey folks, Nick Creighton here. If you’ve been listening to the latest Build Log episode you know I’m all about shipping code that actually moves the needle. In this post I’m taking the audio‑first conversation we just had and turning it into a step‑by‑step guide you can read, bookmark, and act on.

Just a year ago the default way to adapt a large language model (LLM) was to re‑train every single parameter. That meant pulling a ggml or torch checkpoint, slamming a GPU farm together, and waiting hours (or days) for a new .bin file to appear. The result? A massive artifact that cost you in storage, latency, and maintenance.

Fast‑forward to today: LoRA and its cousin QLoRA have become the de‑facto tools for most production teams. They let you add a tiny adapter to a frozen model, keep the base weights untouched, and ship a few megabytes of delta‑weights instead of gigabytes of new model. The upside is immediate – lower cost, faster iteration, and a simpler deployment pipeline.

Below you’ll find the practical, no‑fluff details you need to decide which approach fits your project, how to set it up, and which pitfalls to avoid.

When I first started experimenting with LLMs, the workflow looked like this:

That’s all well and good if you have a dedicated server farm and a budget that looks like a startup’s seed round. But for most indie developers, SaaS founders, or data‑science hobbyists, the cost‑to‑value ratio is terrible.

When It Still Makes Sense

LoRA is essentially a matrix factorisation trick. Instead of updating the full weight matrix W ∈ ℝ^{d×k}, you freeze W and learn two much smaller matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k} such that ΔW = A·B. The rank r is typically 4–64 – orders of magnitude smaller than d or k.

Why It Works

Getting Started with LoRA

lora_cfg = LoraConfig(

r=32, # rank

lora_alpha=64,

target_modules=["q_proj", "v_proj"], # typical for Llama

lora_dropout=0.05,

bias="none",

)

model = get_peft_model(base_model, lora_cfg)

Actionable Tip #1 – Keep the Rank Low, Then Scale

If you’re not sure about the optimal rank, start with r=8. Evaluate on a validation set. If performance plateaus, bump to 16 or 32. The memory overhead grows linearly with r, so you’ll instantly see the trade‑off.

QLoRA builds on LoRA by quantising the base model to 4‑bits (or 8‑bits) using bitsandbytes while still training the adapter in full‑precision. The result: you can fine‑tune a 70‑B model on a single 24 GB GPU!

Key Benefits

Quality Retention: Empirical studies (including my own benchmarks) show Setting Up QLoRA

Install bitsandbytes (CUDA‑compatible version).

pip install bitsandbytes

Load the model with load_in_4bit=True and set bnb_4bit_compute_dtype=torch.float16.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(

"meta-llama/Llama-2-70b-chat-hf",

device_map="auto",

load_in_4bit=True,

bnb_4bit_compute_dtype=torch.float16,

bnb_4bit_quant_type="nf4"

)

Actionable Tip #2 – Use nf4 Quantisation for Better Stability

Bitsandbytes offers two 4‑bit schemes: fp4 and nf4. nf4 (normalised float‑4) tends to preserve the distribution of weights better, which translates to less “catastrophic forgetting” during LoRA training. If you hit sudden spikes in loss, switch to nf4.

  Criterion
  Full‑Model FT
  LoRA
  QLoRA

GPU Budget
  Multiple A100‑40G or V100‑32G
  Single RTX 4090 / A6000
  Single 24 GB GPU (RTX 4090, A6000)

Model Size
  Up to ~13 B comfortably
  Any size (adapter tiny)
  Up to 70 B (quantised)

Deployment Complexity
  High – new artifact, versioning
  Low – swap adapters
  Low – same as LoRA, but smaller runtime

Performance Gap vs Full‑FT
  0 % (baseline)
  ~2–5 % on average
  ~1–3 % on average

Use‑Case Fit
  Token‑embedding changes, architecture tweaks
  Domain‑specific chat, classification, summarisation
  Large‑scale embeddings, retrieval‑augmented generation, heavy traffic services

Below is a minimal script that works for both LoRA and QLoRA. Swap the load_in_4bit flag to toggle.

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

from peft import LoraConfig, get_peft_model

from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(

model_name,

device_map="auto",

torch_dtype=torch.float16,

)

lora_cfg = LoraConfig(

r=32,

lora_alpha=64,

target_modules=["q_proj", "v_proj"],

lora_dropout=0.05,

bias="none",

)

model = get_peft_model(model, lora_cfg)

data = load_dataset("json", data_files={"train": "train.jsonl", "valid": "valid.jsonl"})

def tokenize_fn(example):

tokens = tokenizer(example["prompt"], truncation=True, max_length=512)

tokens["labels"] = tokenizer(example["completion"], truncation=True, max_length=512)["input_ids"]

return tokens

tokenized = data.map(tokenize_fn, batched=True, remove_columns=["prompt", "completion"])

training_args = TrainingArguments(

output_dir="outputs",

per_device_train_batch_size=4,

gradient_accumulation_steps=4,

num_train_epochs=3,

learning_rate=2e-4,

fp16=True,

logging_steps=20,

save_steps=200,

evaluation_strategy="steps",

eval_steps=100,

load_best_model_at_end=True,

)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=tokenized["train"],

eval_dataset=tokenized["valid"],

)

trainer.train()

model.save_pretrained("my_adapter")

tokenizer.save_pretrained("my_adapter")

print("✅ Training complete – adapter saved!")

Actionable Tip #3 – Use Gradient Accumulation to Fit Bigger Batches

Even on a 24 GB card you can simulate a batch size of 32–64 by setting per_device_train_batch_size=4 and gradient_accumulation_steps=8. Larger effective batches improve stability, especially with low‑rank adapters.

Those numbers assume standard on‑demand pricing and a modest 10 GB dataset. The takeaway: you can get production‑grade results for pennies.

To prove the point, here are three production pipelines I’ve deployed on thirteen active websites:

Adapted from an episode of Signal Notes. Listen on your favorite podcast app.

source & further reading

dev.to — original article Translate Git Commit Messages Offline Without Rewriting Code OpenAI Presence: Voice Agents You Can't Self-Serve OpenAI Launches ChatGPT Work for Enterprise Teams With Agentic Controls

~/api · this article 200

$curl api.wpnews.pro/v1/news/fine-tuning-transformers…

Read original on dev.to → dev.to/samchenreviews/fine-tuning-transformers-v…

mentioned entities

Nick Creighton

LoRA

QLoRA

Build Log

metadata

slugfine-tuning-transformers-vs-lora-vs-qlora-2024

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevLocal Ai Deployment Cost Analysi…

next →Rag Vs Fine-Tuning For Document …

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 28 Jul · #large-language-models

Physically Verifiable Evidence and LLM-Based Reporting for Bearing Fault Diagnosis

arxiv.org · 27 Jul · #large-language-models

Encoding Invisible Causation for Bridge Diagnostic Agents: Triple-Guided Retrieval-Augmented Fine-Tuning with QLoRA

dev.to · 26 Jul · #large-language-models

I Discovered AI Agents Can't Self-Verify. The Real Problem Is Much Bigger.

dev.to · 21 Jul · #large-language-models

Eval-Gated AI Releases: Treating Retrieval Quality Like Unit Tests

── more on @nick creighton 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required