{"slug": "qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me", "title": "QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)", "summary": "A developer fine-tuned Qwen2.5-7B on a 16GB T4 GPU using QLoRA, quantizing the frozen base model to 4-bit NF4 to reduce memory footprint from 15GB to 5.44GB. The technique enables training large language models on consumer hardware by combining 4-bit quantization with LoRA adapters, though throughput is limited to about three examples per second. The approach roughly matched the accuracy of smaller models from prior experiments.", "body_md": "In [Part 2](https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if), LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to **Qwen2.5-7B** — and hit a wall that LoRA alone doesn't solve.\n\nA 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would *barely* load, with no room left to actually train.\n\nQLoRA asks the question that naturally follows from LoRA: **the base is frozen and only ever read — so why store it in full precision?**\n\nSo you **quantize the frozen base to 4-bit** (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.\n\n``` python\nfrom transformers import BitsAndBytesConfig\n\nbnb_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_quant_type=\"nf4\",             # NormalFloat4\n    bnb_4bit_use_double_quant=True,        # quantize the quant constants too\n    bnb_4bit_compute_dtype=torch.float16,  # dequantize to fp16 for the matmuls\n)\nmodel = AutoModelForCausalLM.from_pretrained(\n    MODEL_ID, quantization_config=bnb_config, device_map=\"auto\")\n```\n\nEach flag earns its place:\n\n`load_in_4bit`\n\n`nf4`\n\n`double_quant`\n\n`compute_dtype`\n\nOne line of output:\n\n```\nloaded in 4-bit. footprint: 5.44 GB\n```\n\nI downloaded 15.2GB of weights and they sat in memory as **5.44GB.** A model that couldn't be *loaded* for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)\n\nTwo more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target *all* linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:\n\n``` python\nfrom peft import prepare_model_for_kbit_training\nmodel = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)\n# ... attach LoRA to every linear layer ...\nTrainingArguments(optim=\"paged_adamw_8bit\", gradient_checkpointing=True, ...)\n```\n\nA 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But **QLoRA isn't about speed — it's about fit.** The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.\n\n⚠️\n\nHardware note:`bitsandbytes`\n\n4-bit is CUDA-first. It doesnotrun on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).\n\n[Your QLoRA accuracy + macro-F1 here.] It roughly tied the smaller models from Parts 1 and 2.\n\nAnd the `card_arrival`\n\nvs `card_delivery_estimate`\n\nconfusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in [Part 4](https://dev.to/sumanpro/if-a-270m-model-already-worked-why-did-i-fine-tune-a-7b-one-2ae3): **if the 270M model already worked, why did I build any of this?**\n\n📓 **Full runnable notebook on Kaggle:** [https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b](https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b)\n\n*Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.*", "url": "https://wpnews.pro/news/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me", "canonical_source": "https://dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-54gb-in-front-of-me-28n4", "published_at": "2026-06-21 12:20:53+00:00", "updated_at": "2026-06-21 12:36:38.705980+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "developer-tools"], "entities": ["Qwen2.5-7B", "QLoRA", "LoRA", "T4 GPU", "bitsandbytes", "Transformers", "PEFT", "Kaggle"], "alternates": {"html": "https://wpnews.pro/news/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me", "markdown": "https://wpnews.pro/news/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me.md", "text": "https://wpnews.pro/news/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me.txt", "jsonld": "https://wpnews.pro/news/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me.jsonld"}}