TL;DR: Classifier-free guidance above a scale of ~7.5 pushed our SDXL product renders into oversaturation and clipped highlights. Adding CFG rescale at 0.7 plus dynamic thresholding fixed it with no retraining.
Around 18% of our automated product renders at Photoroom came back with blown-out highlights and oversaturated color once we raised the classifier-free guidance scale from 5.0 to 9.0 on our fine-tuned SDXL pipeline. The higher scale gave us sharper adherence to the prompt, which the catalog team wanted, but white backgrounds shifted toward grey-blue and metallic surfaces lost their specular detail. To be precise, the problem was not the prompt and not the fine-tune. It was the guidance arithmetic itself interacting with the noise schedule, and it is well documented if you know where to look.
Classifier-free guidance combines two model predictions at each denoising step: one conditioned on the prompt and one unconditioned. The sampler extrapolates along the vector between them, scaled by a guidance weight. A weight of 1.0 means no guidance, and weights of 5 to 9 are typical for SDXL. Higher weights increase prompt adherence at the cost of pushing latents outside the distribution the model was trained on.
The method comes from Ho and Salimans in Classifier-Free Diffusion Guidance. The formula at each step is straightforward: take the unconditional prediction, add the guidance scale times the difference between conditional and unconditional. The nuance here is that this extrapolation has no bound. As you raise the scale, the standard deviation of the guided prediction grows past the statistics the model learned, and that excess energy shows up in the decoded image as clipping.
The decoded pixel range is fixed, roughly [-1, 1] before the VAE maps it back to RGB. When guidance inflates the variance of the predicted noise, the resulting latents carry larger magnitudes than the VAE was trained to reconstruct cleanly. Bright regions saturate to pure white, and color channels drift because the per-channel means shift together. We measured this directly: at guidance 9.0 the per-image latent standard deviation was about 1.4x the standard deviation of the conditional prediction alone.
This is the same failure mode the Imagen team described in Photorealistic Text-to-Image Diffusion Models, where high guidance weights produced saturated, unnatural images. Their answer was dynamic thresholding. A second, complementary fix came later from Lin and colleagues in Common Diffusion Noise Schedules and Sample Steps are Flawed, which introduced guidance rescale to bring the guided prediction's variance back in line.
CFG rescale corrects the standard deviation of the guided prediction toward the conditional prediction, then blends between the corrected and raw versions by a factor. We set that factor to 0.7 after a sweep. Here is the core of what we run inside the sampler loop:
def apply_cfg_rescale(noise_cond, noise_uncond, guidance_scale, guidance_rescale=0.7):
noise_cfg = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
std_cond = noise_cond.std(dim=[1, 2, 3], keepdim=True)
std_cfg = noise_cfg.std(dim=[1, 2, 3], keepdim=True)
noise_rescaled = noise_cfg * (std_cond / std_cfg)
return guidance_rescale * noise_rescaled + (1.0 - guidance_rescale) * noise_cfg
Dynamic thresholding works at a different layer. At each step it predicts the clean sample, computes a high percentile of the absolute pixel values (we use the 99.5th), and clamps to that value before renormalizing. The two corrections address different symptoms. Rescale fixes the variance inflation; thresholding clamps the residual outliers that survive. Running both at guidance 9.0 brought our oversaturation rate from 18% to under 2% on a held-out set of 4,000 SKUs.
We swept the rescale factor across 0.0, 0.3, 0.5, 0.7, and 1.0 and scored each batch on two axes. The first was a saturation metric: the fraction of pixels with channel values above 0.97 after decoding. The second was CLIP image-text similarity, so we did not trade away the prompt adherence we raised guidance to get. A factor of 1.0 fully matched the conditional variance but flattened contrast on glossy products. A factor of 0.0 left the original problem. The factor of 0.7 held CLIP similarity within 0.4% of the unrescaled run while cutting the saturated-pixel fraction by more than half.
CFG rescale adds two standard deviation reductions and an elementwise blend per step. On our pipeline that is well under 1% of step latency, so cost is not the concern. The real trade-off is contrast. At rescale factors above 0.8 we saw glossy and metallic products lose specular punch, which matters for jewelry and electronics catalogs. Dynamic thresholding has its own edge case: on images that are genuinely meant to be bright and high-key, an aggressive percentile clamps legitimate highlights, so we tuned the percentile per product category rather than globally.
There is also a simpler path we rejected. You can lower the guidance scale back to 5.0 and avoid the whole question, but you lose the prompt fidelity the catalog team asked for. The corrections let us keep a scale of 8.0 to 9.0 without the artifacts, which was the actual goal.
If your renders saturate at high classifier-free guidance, measure the per-image latent standard deviation against the conditional-only prediction before reaching for retraining. The fix is almost always at the guidance arithmetic, not the weights. I would start with CFG rescale at 0.7, add dynamic thresholding only if outliers remain, and validate with a saturated-pixel metric alongside CLIP similarity so you do not silently trade away adherence.