QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

wpnews.pro

cd /news/large-language-models/qlora-fine-tuning-a-7b-model-on-a-16… · home › topics › large-language-models › article

[ARTICLE · art-35560] src=dev.to ↗ pub=2026-06-21T12:20Z topic=large-language-models verified=true sentiment=↑ positive

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

A developer fine-tuned Qwen2.5-7B on a 16GB T4 GPU using QLoRA, quantizing the frozen base model to 4-bit NF4 to reduce memory footprint from 15GB to 5.44GB. The technique enables training large language models on consumer hardware by combining 4-bit quantization with LoRA adapters, though throughput is limited to about three examples per second. The approach roughly matched the accuracy of smaller models from prior experiments.

read2 min views1 publishedJun 21, 2026

In Part 2, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve.

A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train.

QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision?

So you quantize the frozen base to 4-bit (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4
    bnb_4bit_use_double_quant=True,        # quantize the quant constants too
    bnb_4bit_compute_dtype=torch.float16,  # dequantize to fp16 for the matmuls
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb_config, device_map="auto")

Each flag earns its place:

load_in_4bit

nf4

double_quant

compute_dtype

One line of output:

loaded in 4-bit. footprint: 5.44 GB

I downloaded 15.2GB of weights and they sat in memory as 5.44GB. A model that couldn't be loaded for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)

Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target all linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:

from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
TrainingArguments(optim="paged_adamw_8bit", gradient_checkpointing=True, ...)

A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But QLoRA isn't about speed — it's about fit. The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.

⚠️

Hardware note:bitsandbytes

4-bit is CUDA-first. It doesnotrun on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).

[Your QLoRA accuracy + macro-F1 here.] It roughly tied the smaller models from Parts 1 and 2.

And the card_arrival

vs card_delivery_estimate

confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in Part 4: if the 270M model already worked, why did I build any of this?

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b

Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.

source & further reading

dev.to — original article why a simple string match beat apple's nlembedding for local rag Building a sub-millisecond LLM security proxy in Go — lessons from 62 adversarial vectors I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

~/api · this article 200

$curl api.wpnews.pro/v1/news/qlora-fine-tuning-a-7b-m…

Read original on dev.to → dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-…

mentioned entities

Qwen2.5-7B

QLoRA

LoRA

T4 GPU

bitsandbytes

Transformers

PEFT

Kaggle

metadata

slugqlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-5-4gb-in-front-of-me

topic#large-language-models

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevBuilding a sub-millisecond LLM s…

next →A viral doomsday scenario aims t…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 21 Jun · #large-language-models

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

dev.to · 4 Jun · #large-language-models

LLM Fine-Tuning vs RAG: A Production Decision Framework for Engineering Teams

dev.to · 21 Jun · #large-language-models

The Core of a Coding Agent Is 128 Lines of Python. So I Built One From Scratch.

dev.to · 21 Jun · #large-language-models

Solstice Turing Simulation: An Interactive 3D Imitation Game Powered by Google Gemini

── more on @qwen2.5-7b 3 stories trending now

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

wpnews · 20 Jun · #artificial-intelligence

Big Tech redirects buybacks into AI capital spending

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required