11:33
2026-06-16
dev.to
large-language-models
Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model
A developer fine-tuning Llama 3.2 3B on medical QA found that lower evaluation loss (2.275 vs 2.495) corresponded to worse model performance, with generation breakdowns and incorrect medical advice. Tโฆ