# QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

> Source: <https://dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-54gb-in-front-of-me-28n4>
> Published: 2026-06-21 12:20:53+00:00

In [Part 2](https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if), LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to **Qwen2.5-7B** — and hit a wall that LoRA alone doesn't solve.

A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would *barely* load, with no room left to actually train.

QLoRA asks the question that naturally follows from LoRA: **the base is frozen and only ever read — so why store it in full precision?**

So you **quantize the frozen base to 4-bit** (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.

``` python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4
    bnb_4bit_use_double_quant=True,        # quantize the quant constants too
    bnb_4bit_compute_dtype=torch.float16,  # dequantize to fp16 for the matmuls
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb_config, device_map="auto")
```

Each flag earns its place:

`load_in_4bit`

`nf4`

`double_quant`

`compute_dtype`

One line of output:

```
loaded in 4-bit. footprint: 5.44 GB
```

I downloaded 15.2GB of weights and they sat in memory as **5.44GB.** A model that couldn't be *loaded* for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)

Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target *all* linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:

``` python
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# ... attach LoRA to every linear layer ...
TrainingArguments(optim="paged_adamw_8bit", gradient_checkpointing=True, ...)
```

A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But **QLoRA isn't about speed — it's about fit.** The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.

⚠️

Hardware note:`bitsandbytes`

4-bit is CUDA-first. It doesnotrun on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).

[Your QLoRA accuracy + macro-F1 here.] It roughly tied the smaller models from Parts 1 and 2.

And the `card_arrival`

vs `card_delivery_estimate`

confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in [Part 4](https://dev.to/sumanpro/if-a-270m-model-already-worked-why-did-i-fine-tune-a-7b-one-2ae3): **if the 270M model already worked, why did I build any of this?**

📓 **Full runnable notebook on Kaggle:** [https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b](https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b)

*Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.*
