QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

A developer fine-tuned Qwen2.5-7B on a 16GB T4 GPU using QLoRA, quantizing the frozen base model to 4-bit NF4 to reduce memory footprint from 15GB to 5.44GB. The technique enables training large language models on consumer hardware by combining 4-bit quantization with LoRA adapters, though throughput is limited to about three examples per second. The approach roughly matched the accuracy of smaller models from prior experiments.

In Part 2 https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if , LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit ~3GB . Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve. A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train. QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision? So you quantize the frozen base to 4-bit NF4, a format tuned for how neural-net weights are distributed and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise. python from transformers import BitsAndBytesConfig bnb config = BitsAndBytesConfig load in 4bit=True, bnb 4bit quant type="nf4", NormalFloat4 bnb 4bit use double quant=True, quantize the quant constants too bnb 4bit compute dtype=torch.float16, dequantize to fp16 for the matmuls model = AutoModelForCausalLM.from pretrained MODEL ID, quantization config=bnb config, device map="auto" Each flag earns its place: load in 4bit nf4 double quant compute dtype One line of output: loaded in 4-bit. footprint: 5.44 GB I downloaded 15.2GB of weights and they sat in memory as 5.44GB. A model that couldn't be loaded for full fine-tuning was now training on a single consumer GPU — with room to spare. The download is still 15GB; bitsandbytes quantizes on the fly during load. Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target all linear layers the QLoRA paper found this matters , with a paged 8-bit optimizer: python from peft import prepare model for kbit training model = prepare model for kbit training model, use gradient checkpointing=True ... attach LoRA to every linear layer ... TrainingArguments optim="paged adamw 8bit", gradient checkpointing=True, ... A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But QLoRA isn't about speed — it's about fit. The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point. ⚠️ Hardware note: bitsandbytes 4-bit is CUDA-first. It doesnotrun on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU Kaggle/Colab T4 works . Your QLoRA accuracy + macro-F1 here. It roughly tied the smaller models from Parts 1 and 2. And the card arrival vs card delivery estimate confusion that haunted both smaller models? Say what happened at 7B — did it finally fix it, or hit the same wall? Either way, it sets up the question I tackle in Part 4 https://dev.to/sumanpro/if-a-270m-model-already-worked-why-did-i-fine-tune-a-7b-one-2ae3 : if the 270M model already worked, why did I build any of this? 📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.