I wanted to actually understand fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: same task, three techniques, smallest model to largest. Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method.
This first post is full fine-tuning — the most powerful and most expensive option: update every weight in the model.
Banking77: ~13,000 real bank customer-support messages, 77 intents like card_arrival
, lost_or_stolen_card
, exchange_rate
. The model reads a message and names the intent.
I picked Gemma 3, 270M parameters — small enough to fully fine-tune on a laptop (Apple Silicon / MPS). That's intentional: full fine-tuning stores gradients and optimizer state for every parameter, roughly 4× the model's size in memory. I wanted to feel that, not read about it.
The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model generate the intent as text — literally output card_arrival
. Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one.
The key detail is masking the loss so the model is graded only on the label tokens, not the prompt:
prompt_ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
target_ids = tokenizer(" " + label_name + tokenizer.eos_token,
add_special_tokens=False)["input_ids"]
input_ids = prompt_ids + target_ids
labels = [-100] * len(prompt_ids) + target_ids # only the label is graded
If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer.
Because you're updating all the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point:
TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=5e-5, # small, on purpose
lr_scheduler_type="cosine",
bf16=False, fp16=False, # fp32 on MPS for stability
)
(The later projects freeze the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck.)
~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task.
The one persistent slip: it confused ** card_arrival** with
card_delivery_estimate
In Part 2, I take a model 5× bigger and train less than 1% of it — and get the same accuracy. That's LoRA.
📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m
Built with PyTorch + Hugging Face Transformers. Questions or corrections welcome in the comments.