{"slug": "fine-tuning-transformers-vs-lora-vs-qlora-2024", "title": "Fine-Tuning Transformers Vs Lora Vs Qlora 2024", "summary": "Nick Creighton, a developer and host of the *Build Log* podcast, has published a practical comparison of full-model fine-tuning, LoRA, and QLoRA for adapting large language models in 2024. The guide details how LoRA and QLoRA have largely replaced full retraining in production by using small, low-cost adapters that keep base model weights frozen, drastically reducing storage, latency, and GPU requirements. Creighton provides actionable setup instructions and benchmarks showing that QLoRA can fine-tune a 70-billion-parameter model on a single 24 GB GPU with only a 1–3% performance gap compared to full fine-tuning.", "body_md": "Hey folks, Nick Creighton here. If you’ve been listening to the latest *Build Log* episode you know I’m all about shipping code that actually moves the needle. In this post I’m taking the audio‑first conversation we just had and turning it into a step‑by‑step guide you can read, bookmark, and act on.\n\nJust a year ago the default way to adapt a large language model (LLM) was to **re‑train every single parameter**. That meant pulling a ggml or torch checkpoint, slamming a GPU farm together, and waiting hours (or days) for a new .bin file to appear. The result? A massive artifact that cost you in storage, latency, and maintenance.\n\nFast‑forward to today: **LoRA** and its cousin **QLoRA** have become the de‑facto tools for most production teams. They let you *add a tiny adapter* to a frozen model, keep the base weights untouched, and ship a few megabytes of delta‑weights instead of gigabytes of new model. The upside is immediate – lower cost, faster iteration, and a simpler deployment pipeline.\n\nBelow you’ll find the practical, no‑fluff details you need to decide which approach fits your project, how to set it up, and which pitfalls to avoid.\n\nWhen I first started experimenting with LLMs, the workflow looked like this:\n\nThat’s all well and good if you have a dedicated server farm and a budget that looks like a startup’s seed round. But for most indie developers, SaaS founders, or data‑science hobbyists, the **cost‑to‑value ratio** is terrible.\n\nWhen It Still Makes Sense\n\nLoRA is essentially a **matrix factorisation** trick. Instead of updating the full weight matrix W ∈ ℝ^{d×k}, you freeze W and learn two much smaller matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k} such that ΔW = A·B. The rank r is typically 4–64 – orders of magnitude smaller than d or k.\n\nWhy It Works\n\nGetting Started with LoRA\n\nlora_cfg = LoraConfig(\n\nr=32, # rank\n\nlora_alpha=64,\n\ntarget_modules=[\"q_proj\", \"v_proj\"], # typical for Llama\n\nlora_dropout=0.05,\n\nbias=\"none\",\n\n)\n\nmodel = get_peft_model(base_model, lora_cfg)\n\nActionable Tip #1 – Keep the Rank Low, Then Scale\n\nIf you’re not sure about the optimal rank, start with r=8. Evaluate on a validation set. If performance plateaus, bump to 16 or 32. The memory overhead grows linearly with r, so you’ll instantly see the trade‑off.\n\nQLoRA builds on LoRA by **quantising the base model to 4‑bits (or 8‑bits) using bitsandbytes** while still training the adapter in full‑precision. The result: you can fine‑tune a 70‑B model on a single 24 GB GPU!\n\nKey Benefits\n\n**Quality Retention**: Empirical studies (including my own benchmarks) show Setting Up QLoRA\n\nInstall bitsandbytes (CUDA‑compatible version).\n\npip install bitsandbytes\n\nLoad the model with load_in_4bit=True and set bnb_4bit_compute_dtype=torch.float16.\n\nfrom transformers import AutoModelForCausalLM\n\nmodel = AutoModelForCausalLM.from_pretrained(\n\n\"meta-llama/Llama-2-70b-chat-hf\",\n\ndevice_map=\"auto\",\n\nload_in_4bit=True,\n\nbnb_4bit_compute_dtype=torch.float16,\n\nbnb_4bit_quant_type=\"nf4\"\n\n)\n\nActionable Tip #2 – Use nf4 Quantisation for Better Stability\n\nBitsandbytes offers two 4‑bit schemes: fp4 and nf4. nf4 (normalised float‑4) tends to preserve the distribution of weights better, which translates to less “catastrophic forgetting” during LoRA training. If you hit sudden spikes in loss, switch to nf4.\n\n```\n  Criterion\n  Full‑Model FT\n  LoRA\n  QLoRA\n\nGPU Budget\n  Multiple A100‑40G or V100‑32G\n  Single RTX 4090 / A6000\n  Single 24 GB GPU (RTX 4090, A6000)\n\nModel Size\n  Up to ~13 B comfortably\n  Any size (adapter tiny)\n  Up to 70 B (quantised)\n\nDeployment Complexity\n  High – new artifact, versioning\n  Low – swap adapters\n  Low – same as LoRA, but smaller runtime\n\nPerformance Gap vs Full‑FT\n  0 % (baseline)\n  ~2–5 % on average\n  ~1–3 % on average\n\nUse‑Case Fit\n  Token‑embedding changes, architecture tweaks\n  Domain‑specific chat, classification, summarisation\n  Large‑scale embeddings, retrieval‑augmented generation, heavy traffic services\n```\n\nBelow is a minimal script that works for both LoRA and QLoRA. Swap the load_in_4bit flag to toggle.\n\nimport torch\n\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments\n\nfrom peft import LoraConfig, get_peft_model\n\nfrom datasets import load_dataset\n\nmodel_name = \"meta-llama/Llama-2-7b-chat-hf\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n\nmodel_name,\n\ndevice_map=\"auto\",\n\ntorch_dtype=torch.float16,\n\n# Uncomment the next three lines for QLoRA\n\n# load_in_4bit=True,\n\n# bnb_4bit_compute_dtype=torch.float16,\n\n# bnb_4bit_quant_type=\"nf4\",\n\n)\n\nlora_cfg = LoraConfig(\n\nr=32,\n\nlora_alpha=64,\n\ntarget_modules=[\"q_proj\", \"v_proj\"],\n\nlora_dropout=0.05,\n\nbias=\"none\",\n\n)\n\nmodel = get_peft_model(model, lora_cfg)\n\ndata = load_dataset(\"json\", data_files={\"train\": \"train.jsonl\", \"valid\": \"valid.jsonl\"})\n\ndef tokenize_fn(example):\n\ntokens = tokenizer(example[\"prompt\"], truncation=True, max_length=512)\n\ntokens[\"labels\"] = tokenizer(example[\"completion\"], truncation=True, max_length=512)[\"input_ids\"]\n\nreturn tokens\n\ntokenized = data.map(tokenize_fn, batched=True, remove_columns=[\"prompt\", \"completion\"])\n\ntraining_args = TrainingArguments(\n\noutput_dir=\"outputs\",\n\nper_device_train_batch_size=4,\n\ngradient_accumulation_steps=4,\n\nnum_train_epochs=3,\n\nlearning_rate=2e-4,\n\nfp16=True,\n\nlogging_steps=20,\n\nsave_steps=200,\n\nevaluation_strategy=\"steps\",\n\neval_steps=100,\n\nload_best_model_at_end=True,\n\n)\n\ntrainer = Trainer(\n\nmodel=model,\n\nargs=training_args,\n\ntrain_dataset=tokenized[\"train\"],\n\neval_dataset=tokenized[\"valid\"],\n\n)\n\ntrainer.train()\n\nmodel.save_pretrained(\"my_adapter\")\n\ntokenizer.save_pretrained(\"my_adapter\")\n\nprint(\"✅ Training complete – adapter saved!\")\n\nActionable Tip #3 – Use Gradient Accumulation to Fit Bigger Batches\n\nEven on a 24 GB card you can simulate a batch size of 32–64 by setting per_device_train_batch_size=4 and gradient_accumulation_steps=8. Larger effective batches improve stability, especially with low‑rank adapters.\n\nThose numbers assume [standard on‑demand pricing](https://cloud.google.com/compute/pricing) and a modest 10 GB dataset. The takeaway: **you can get production‑grade results for pennies.**\n\nTo prove the point, here are three production pipelines I’ve deployed on *thirteen* active websites:\n\n*Adapted from an episode of Signal Notes. Listen on your favorite podcast app.*", "url": "https://wpnews.pro/news/fine-tuning-transformers-vs-lora-vs-qlora-2024", "canonical_source": "https://dev.to/samchenreviews/fine-tuning-transformers-vs-lora-vs-qlora-2024-l2", "published_at": "2026-06-12 16:21:57+00:00", "updated_at": "2026-06-12 16:41:32.047826+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "generative-ai", "ai-tools"], "entities": ["Nick Creighton", "LoRA", "QLoRA", "Build Log"], "alternates": {"html": "https://wpnews.pro/news/fine-tuning-transformers-vs-lora-vs-qlora-2024", "markdown": "https://wpnews.pro/news/fine-tuning-transformers-vs-lora-vs-qlora-2024.md", "text": "https://wpnews.pro/news/fine-tuning-transformers-vs-lora-vs-qlora-2024.txt", "jsonld": "https://wpnews.pro/news/fine-tuning-transformers-vs-lora-vs-qlora-2024.jsonld"}}