{"slug": "the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training", "title": "The bf16 grad accumulator that killed our SDXL LoRA training", "summary": "Photoroom's SDXL LoRA fine-tuning for a product photography model silently corrupted its adapter weights over six days due to a bf16 gradient accumulation issue. The custom training loop, forked from an internal repo two years ago, accumulated gradients in bf16 instead of fp32, causing most gradients to quantize to zero during accumulation. The bug went unnoticed because evaluation scores remained within the normal range, and was only discovered when a developer re-enabled per-step gradient norm logging and found norms collapsing to ~1e-5.", "body_md": "**TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause was bf16 gradient accumulation interacting badly with a custom adapter init we'd ported from a paper. Eval scores stayed in the same range the whole time, which is why nobody noticed.**\n\nWe train SDXL LoRAs for product photography categories at Photoroom. Bottles, packaged food, soft goods. Each LoRA is 192MB. Training stack: PyTorch 2.3, bf16 mixed precision, gradient accumulation across 8 steps, A100 80GBs.\n\nThe LoRA init follows a small modification of the OFT paper for better stability on small datasets. To be precise, we orthogonalize the down-projection before training begins, then let the up-projection drift freely. This had been working for nine months.\n\nSix days into a 7-day run, our automated CLIPScore check started showing variance that was technically inside our acceptance band but trending the wrong way. The nuance here is that our eval pipeline grades generations using a fan-out across three VLM providers (Claude vision, GPT-4o, Gemini 1.5) routed through an LLM gateway. We use [Bifrost](https://github.com/maximhq/bifrost) for that fan-out, which gives us provider-level failover when one of them rate-limits us mid-grade. Useful, and uneventful. The grade scores looked fine.\n\nThe real signal was a per-step gradient norm log we'd turned off a quarter earlier when it was spamming the dashboard. When I turned it back on for a sanity check, the grad norms had been collapsing to ~1e-5 in the down-projection layer since step 12,000.\n\nI added a hook to dump the raw bf16 gradient tensors before they hit the accumulator:\n\n``` python\nimport torch\n\ndef grad_dump_hook(name):\n    def hook(grad):\n        finite = torch.isfinite(grad).all().item()\n        absmax = grad.abs().max().item() if finite else float(\"nan\")\n        if absmax < 1e-4 or not finite:\n            print(f\"[{name}] finite={finite} absmax={absmax:.2e}\")\n        return grad\n    return hook\n\nfor n, p in lora.named_parameters():\n    if p.requires_grad and \"lora_A\" in n:\n        p.register_hook(grad_dump_hook(n))\n```\n\nOutput across 200 steps:\n\n```\n[blocks.7.attn1.lora_A] finite=True absmax=4.32e-06\n[blocks.8.attn1.lora_A] finite=True absmax=2.11e-06\n[blocks.9.attn1.lora_A] finite=True absmax=8.54e-07\n```\n\nGradient magnitudes in bf16 land are bounded below by roughly 6e-8 before they round to zero. We were producing real gradients, but most of them were being silently quantized to zero during the accumulation step. The accumulator in our custom training loop accumulated in bf16, not fp32.\n\nThis is documented behavior. Standard PyTorch grad accumulation in Accelerate uses fp32 accumulators by default. Our custom loop, forked from an internal repo two years ago, did not.\n\n```\n# before\noptim.param_groups[0][\"grad_accumulator_dtype\"] = torch.bfloat16\n\n# after\noptim.param_groups[0][\"grad_accumulator_dtype\"] = torch.float32\n```\n\nSingle line. Six days. We re-ran training with fp32 accumulation. Grad norms stabilized in the expected 1e-3 to 1e-2 range. Eval scores moved up by ~6% in our internal background-consistency metric.\n\nA few obvious questions:\n\n**Why didn't loss curves show it?** They did, mildly. The loss was still going down, only slower. Within noise of a normal run.\n\n**Why didn't the VLM eval catch it?** Because the generations were still \"good.\" Product on a clean background, lighting roughly correct. The drift was in finer details (brand text legibility, soft-good fabric texture) that our three-VLM grading averages out. We're now adding a per-category CLIPScore-vs-reference check that runs without averaging.\n\n**Why did we trust the init?** We had nine months of green runs. The OFT-style init only became a problem when we tightened the LR schedule three weeks ago, which made the gradient magnitudes smaller across the board.\n\n| Approach | Memory cost | Bookkeeping |\n|---|---|---|\n| bf16 accumulator | baseline | low |\n| fp32 accumulator (all params) | +4% peak | low |\n| fp32 only for LoRA params | +0.6% peak | painful |\n\nThe fp32 accumulator costs us ~4% more GPU memory per step. Not free. On the A100 80GBs it's invisible, but if you're tight on the H100 80GBs or sharing with another job, you'll feel it.\n\nYou can accumulate in fp32 only for the LoRA params and keep the base model gradients in bf16, but the bookkeeping is annoying. We took the 4% hit.\n\nThe deeper lesson: a custom training loop that worked for nine months is not a training loop you understand. It's a training loop that hasn't been stressed in the right place yet. I should have re-read it when we changed the LR schedule.", "url": "https://wpnews.pro/news/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training", "canonical_source": "https://dev.to/elise_moreau/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training-nc8", "published_at": "2026-05-27 05:37:20+00:00", "updated_at": "2026-05-27 05:52:39.962999+00:00", "lang": "en", "topics": ["machine-learning", "generative-ai", "ai-research", "mlops"], "entities": ["Photoroom", "SDXL", "LoRA", "OFT", "PyTorch", "A100", "Bifrost", "Claude"], "alternates": {"html": "https://wpnews.pro/news/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training", "markdown": "https://wpnews.pro/news/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training.md", "text": "https://wpnews.pro/news/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training.txt", "jsonld": "https://wpnews.pro/news/the-bf16-grad-accumulator-that-killed-our-sdxl-lora-training.jsonld"}}