Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?

Researchers have demonstrated that human preference scores for text-to-image content can be accurately predicted before generation begins, using minimal computational overhead. The study found that predicting these scores allows for selective generation of higher-quality images, particularly benefiting smaller diffusion models used in local deployment. This advance could reduce wasted computing resources by enabling systems to skip low-quality outputs before they are rendered.

arXiv:2606.05478v1 Announce Type: new Abstract: Diffusion Models DM have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics HPM that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.