I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

A developer fully fine-tuned a 270M-parameter Gemma 3 model on a laptop using the Banking77 dataset, achieving ~96% accuracy on intent classification. The project used full fine-tuning with loss masking to focus only on label tokens, and the training config used a low learning rate of 5e-5 to preserve pretrained knowledge.

I wanted to actually understand fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: same task, three techniques, smallest model to largest. Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method. This first post is full fine-tuning — the most powerful and most expensive option: update every weight in the model. Banking77 https://huggingface.co/datasets/mteb/banking77 : ~13,000 real bank customer-support messages, 77 intents like card arrival , lost or stolen card , exchange rate . The model reads a message and names the intent. I picked Gemma 3, 270M parameters — small enough to fully fine-tune on a laptop Apple Silicon / MPS . That's intentional: full fine-tuning stores gradients and optimizer state for every parameter, roughly 4× the model's size in memory. I wanted to feel that, not read about it. The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model generate the intent as text — literally output card arrival . Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one. The key detail is masking the loss so the model is graded only on the label tokens, not the prompt: build "prompt + label", but set prompt tokens to -100 so the loss ignores them prompt ids = tokenizer prompt, add special tokens=False "input ids" target ids = tokenizer " " + label name + tokenizer.eos token, add special tokens=False "input ids" input ids = prompt ids + target ids labels = -100 len prompt ids + target ids only the label is graded If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer. Because you're updating all the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point: TrainingArguments num train epochs=3, per device train batch size=16, learning rate=5e-5, small, on purpose lr scheduler type="cosine", bf16=False, fp16=False, fp32 on MPS for stability The later projects freeze the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck. ~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task. The one persistent slip: it confused card arrival with card delivery estimate In Part 2 https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if , I take a model 5× bigger and train less than 1% of it — and get the same accuracy. That's LoRA. 📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m Built with PyTorch + Hugging Face Transformers. Questions or corrections welcome in the comments.