HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution

Benchmarking the HiDream-O1-Image model to find optimal speed-quality trade-offs for iterative UI-based image generation. The author found that using 1536x1536 resolution with 28–36 steps and a guidance scale of 5.0 provides a good balance, reducing generation time from 33 seconds to roughly 10 seconds. The recommended workflow is to explore ideas quickly at low resolution (1024x1024) and low steps (24), then re-render promising results at higher quality settings.

TL;DR I'm running HiDream-O1-Image Full as a persistent local server integrated into a Studio UI. The official recipe — 2048x2048 / 50 steps / guidance 5.0 — produces beautiful results, but each image takes around 33 seconds. That's too slow for iterative exploration. So I held the prompt and seed constant and swept steps , guidance , and resolution. The sweet spots were clear. | Config | Time | vs. Official | |---|---|---| 2048 / 50 steps / g5 | 33.37s | 1.00x | 2048 / 28 steps / g5 | 18.41s | 1.81x | 1536 / 20 steps / g5 | 7.14s | 4.67x | 1024 / 20 steps / g5 | 3.83s | 8.71x | The takeaway: explore direction at low resolution and low steps, then do the final render at full quality. In particular, 1536x1536 / 28–36 steps hits a very good speed-quality balance. Motivation Once image generation is embedded in a UI, iteration speed matters more than peak quality. The real workflow isn't "generate one perfect image." It looks like this: - Check composition, mood, outfit, background direction - Tweak the prompt slightly - Try different seeds - Re-render only the promising candidates at full quality Waiting 30+ seconds per generation makes that loop painful. Being able to see rough candidates in 5–10 seconds is a completely different experience. The goal here isn't "the best single image" — it's understanding how far you can cut exploration cost without breaking quality in a meaningful way . Environment - GPU : NVIDIA RTX PRO 6000 Blackwell Max-Q 96 GB VRAM - Model : HiDream-O1-Image Full 8B, bf16 - Inference server : Custom Python HTTP server with model kept resident - Measured : One /generate/t2i request after model load - Seed : 42 - Prompt : A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail All comparison images use the same prompt and seed. Only steps , guidance scale , resolution, and resolution snapping are varied. | Parameter | Value | |---|---| | prompt | A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail | | seed | 42 | | mode | t2i | | dtype | bf16 | | negative prompt | none | | sampler / scheduler | HiDream pipeline default | I used a portrait because hair, skin, background light, and fine detail are easy to compare. That said, a young woman's face has relatively little texture and wrinkle detail to begin with, so it's actually a forgiving subject for low-step generation — I'll come back to that. Images in this article are contact sheets with results side by side. Pixel-peeping is easier at full resolution, but for UI-driven exploration the first question is "does this look worth keeping?" — so I've prioritized at-a-glance comparison here. Start by Reducing Steps Fixed guidance=5.0 and 2048x2048 , varied only steps. | Resolution | Steps | Guidance | Elapsed | Speedup vs 50 steps | |---|---|---|---|---| | 2048x2048 | 20 | 5.0 | 13.070s | 2.55x | | 2048x2048 | 28 | 5.0 | 18.412s | 1.81x | | 2048x2048 | 36 | 5.0 | 23.854s | 1.40x | | 2048x2048 | 50 | 5.0 | 33.370s | 1.00x | Pretty much theoretical scaling. In this HiDream path, when guidance 1.0 , both conditional and unconditional forwards run, so reducing steps translates directly to lower latency. Visually: 20 steps shows some roughness. 28 steps looks fine at first glance, though fine detail thins out under comparison. 36 steps holds up well for most use cases. guidance=1.0 Is Significantly Faster Next I varied guidance as well, comparing practical preset candidates. | Preset | Resolution | Steps | Guidance | CFG | Elapsed | |---|---|---|---|---|---| | Draft | 2048x2048 | 24 | 1.0 | off | 8.164s | | Balanced | 2048x2048 | 36 | 3.0 | on | 23.664s | | Official | 2048x2048 | 50 | 5.0 | on | 32.609s | guidance=1.0 effectively disables CFG, so it's faster than step count alone would suggest — 24 steps lands in the 8-second range. The trade-off is that lower guidance changes prompt adherence and overall aesthetics. Fine for idea validation, but for prompts involving text, specific clothing details, or precise multi-element placement, staying at guidance=3–5 is safer. The Resolution Trap: Requesting 1024 Doesn't Make It Faster My first instinct was to just pass width=1024, height=1024 and get a faster result. But the official pipeline doesn't use the requested resolution directly — it snaps to the nearest fixed aspect-ratio bucket. Measured results: | Requested | Actual | |---|---| | 512x512 | 2048x2048 | | 1024x1024 | 2048x2048 | | 2048x2048 | 2048x2048 | | 1280x720 | 2560x1440 | | 720x1280 | 1440x2560 | | 1024x768 | 2304x1728 | Sending 1024x1024 from the UI does nothing — square aspect ratios all resolve to 2048x2048 . The snapping logic lives in models/utils.py under PREDEFINED RESOLUTIONS , and it seems intentionally designed to favor output stability. Bypassing Buckets for True Low-Resolution Generation For experimentation I added a snap resolution=false flag that bypasses the pipeline's resolution snapping. For safety, arbitrary resolutions are constrained to: - width and height aligned to 32px - 256px minimum - max 4.3MP total Comparing 1024 / 1536 / 2048 at 20 steps / guidance=5.0 : | Resolution | Elapsed | Speedup vs 2048 | |---|---|---| | 1024x1024 | 3.831s | 3.47x | | 1536x1536 | 7.139s | 1.86x | | 2048x2048 | 13.278s | 1.00x | This is where the real gains are. Given that the official 2048 recipe sits at 30+ seconds, 1536 + 28 steps should land around 10 seconds — a completely different feel. 1024 is fast but noticeably lower in information density. Good for directional checks, but probably too rough for regular output use. Presets in the Studio UI Based on these results, here's what I settled on in the Studio UI: | Use case | Resolution | Steps | Guidance | When to use | |---|---|---|---|---| | Quick preview | 1024x1024 | 20–24 | 1.0–3.0 | Composition / mood check | | Standard | 1536x1536 | 28–36 | 3.0–5.0 | Day-to-day | | High quality | 2048x2048 | 36–50 | 5.0 | Re-render of selected candidates | | Official bucket | bucket | 50 | 5.0 | Match upstream recipe exactly | Steps and resolution are independently selectable in the UI. The workflow is: explore with 1024 / 24 steps , then re-render promising results at 1536 or 2048 with the same prompt and seed. Cases Where Quality Degradation Shows Up With this portrait, the difference between 28 steps and 50 steps was "visible under comparison" — not obvious at a glance. But part of that is the subject matter. Low steps and low resolution tend to hurt most with: - Older faces, wrinkles, skin texture - Hands, fingers, jewelry - Fabric with fine patterns - Text in signs or books - Multiple people - Busy indoor scenes with lots of background objects Conversely, young faces, simple backgrounds, and soft lighting are forgiving — low-cost settings hold up well. That's why a single fixed preset isn't the right design. Giving users control over exploration cost depending on what they're generating is the better approach. Reproduction Commands The benchmark script lives at image server/bench quality speed.py . It calls the HTTP API after the model is already resident, so model load time is excluded from all measurements. ./image server/start image server.sh Steps comparison: python3 image server/bench quality speed.py \ --prompt "A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail" \ --seed 42 \ --variant s20 g5,20,5 \ --variant s28 g5,28,5 \ --variant s36 g5,36,5 \ --variant s50 g5,50,5 Resolution comparison: python3 image server/bench quality speed.py \ --prompt "A cinematic portrait photo of a woman in a rainy neon street, detailed skin, 85mm lens, realistic lighting, high detail" \ --seed 42 \ --variant s20 g5,20,5 \ --size 1024x1024 \ --size 1536x1536 \ --size 2048x2048 \ --no-snap-resolution Summary HiDream-O1-Image Full is excellent at its official settings but too slow for iterative use. When you break down steps, CFG, and resolution separately, the speedups are clean and predictable. - Steps scale almost linearly with time - guidance=1.0 drops CFG and gives a large speed boost - The official pipeline snaps resolutions to fixed buckets - True low-resolution generation at 1024/1536 is dramatically faster - 1536 / 28–36 steps is the practical sweet spot For image generation UIs, low-cost exploration → high-quality final render is a much better flow than starting at maximum quality every time. This experiment gave me a solid basis for building exactly that.