cd /news/large-language-models/qlora-fine-tuning-a-7b-model-on-a-16… · home topics large-language-models article
[ARTICLE · art-35560] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

A developer fine-tuned Qwen2.5-7B on a 16GB T4 GPU using QLoRA, quantizing the frozen base model to 4-bit NF4 to reduce memory footprint from 15GB to 5.44GB. The technique enables training large language models on consumer hardware by combining 4-bit quantization with LoRA adapters, though throughput is limited to about three examples per second. The approach roughly matched the accuracy of smaller models from prior experiments.

read2 min views1 publishedJun 21, 2026

In Part 2, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve.

A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train.

QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision?

So you quantize the frozen base to 4-bit (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4
    bnb_4bit_use_double_quant=True,        # quantize the quant constants too
    bnb_4bit_compute_dtype=torch.float16,  # dequantize to fp16 for the matmuls
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb_config, device_map="auto")

Each flag earns its place:

load_in_4bit

nf4

double_quant

compute_dtype

One line of output:

loaded in 4-bit. footprint: 5.44 GB

I downloaded 15.2GB of weights and they sat in memory as 5.44GB. A model that couldn't be loaded for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)

Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target all linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:

from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
TrainingArguments(optim="paged_adamw_8bit", gradient_checkpointing=True, ...)

A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But QLoRA isn't about speed — it's about fit. The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.

⚠️

Hardware note:bitsandbytes

4-bit is CUDA-first. It doesnotrun on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).

[Your QLoRA accuracy + macro-F1 here.] It roughly tied the smaller models from Parts 1 and 2.

And the card_arrival

vs card_delivery_estimate

confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in Part 4: if the 270M model already worked, why did I build any of this?

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b

Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.

── more in #large-language-models 4 stories · sorted by recency
── more on @qwen2.5-7b 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/qlora-fine-tuning-a-…] indexed:0 read:2min 2026-06-21 ·