HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked

Benchmarking the HiDream-O1-Image model revealed that its "skeleton mode" does not have a dedicated code path and instead processes all reference images (face, background, pose) through the same pipeline, relying solely on the prompt for differentiation. The key finding was that including a background reference image severely limits the model's ability to follow pose instructions, and removing the background ref while using a shift value of 2.0 produced the best results for natural, instruction-following try-on outputs.

TL;DR After benchmarking HiDream-O1-Image https://huggingface.co/HiDream-ai/HiDream-O1-Image released 2026-05, OpenWeight 8B, ranked 8 on Artificial Analysis Text-to-Image Arena across 8 skeleton try-on mode patterns plus 3 layout patterns, three counterintuitive findings emerged. - Passing an openpose ref actually locks the pose to the ref's composition. When you want dynamic poses, dropping the openpose ref and specifying the pose via prompt is more effective. - Using 6 refs face + bg + pose + parts, the full set compresses each ref down to 768px, degrading fine details. Keeping it to 3–4 refs maintains 1024px and produces better quality. - The README-recommended shift=1.0 is strictly for try-on use. For pose/outfit swaps use shift=2.0-2.5 ; for complete scene replacement use shift=3.0 . Reading pipeline.py reveals that there is no dedicated code path for skeleton mode. Both /generate/skeleton and /generate/ip go through exactly the same multi-ref pipeline internally, and whether a ref is a face, background, openpose, or clothing is communicated only through the prompt . That's the root cause of everything. Motivation After running HiDream-O1-Image on a local GPU RTX PRO 6000 Blackwell, 96 GB and integrating it into our own platform, we hit a problem: skeleton try-on mode wasn't following prompt instructions. Writing "jump with both hands raised" only produced stiff, upright try-on photos. Suspecting guardrails NSFW filters, safety policies, etc. , I grepped for safety|nsfw|guard|filter|moderate|censor — HiDream's codebase has none of that the only hit was CSS backdrop-filter: blur . As expected from an MIT-licensed OpenWeight model, no censorship. So what's actually wrong? Here's what I found after reading pipeline.py and running 8 + 3 patterns on real hardware. Environment - GPU : NVIDIA RTX PRO 6000 Blackwell Max-Q 96 GB VRAM - PyTorch : 2.12.0 + CUDA 13.0 - flash-attn : 2.8.3 sm 120-only build - Model : HiDream-O1-Image Full 8B, bf16, ~16.4 GiB resident - Inference server : custom Python BaseHTTPRequestHandler port 8895 - Resolution : pipeline internal bucket forces snap to 2048×2048 Measured wall time per 50-step generation: | Mode | Time | iter speed | |---|---|---| | t2i no ref | ~33s | 1.52 it/s | | edit 1 ref | ~76s | 1.01 it/s | | skeleton multi ref | ~84s | 1.34 it/s | | ip multi ref | ~76s | 1.81 it/s | | layout multi ref + bbox | ~83s | 1.21 it/s | Test Assets The HiDream repo's assets/IP skeleton/ includes a full skeleton set. These are used as-is for all tests. | ref | Content | Intended role | |---|---|---| | Person's face photo | Identity reference | | | Stick figure in OpenPose format | Pose specification | | | Background photo interior | Scene reference | | | Clothing parts sweater, boots | Outfit reference | 8-Pattern Skeleton Benchmark Each pattern calls /api/studio/skeleton i.e., generate image with skeleton-mode-equivalent arguments . All parameters except shift and guidance scale are fixed 50 steps, seed=42 . A — Baseline README defaults, all 6 refs curl -X POST http://localhost:8895/generate/skeleton \ -H 'Content-Type: application/json' \ -d '{ "prompt": "Create a realistic try-on image of the person wearing the provided clothing.", "ref image paths": "face","bg","openpose","part 1","part 2","part 3" , "shift": 1.0, "seed": 42 }' Result : The bg ref's walls and shelves are reproduced exactly. Pose also matches the openpose ref's upright stance. Faithful as a try-on, but zero freedom of movement. B — Higher shift same 6 refs, shift=2.5 curl -X POST http://localhost:8895/generate/skeleton -d '{ "prompt": "Create a realistic try-on image of the person wearing the provided clothing.", "ref image paths": "face","bg","openpose","part 1","part 2","part 3" , "shift": 2.5, "seed": 42 }' curl -X POST http://localhost:8895/generate/skeleton -d '{ "prompt": "...", "ref image paths": ...6 refs... , "shift": 2.5, "guidance scale": 7.0, "seed": 42 }' Result : Necklace deforms strangely. Raising guidance starts producing artifacts. The Full model's sweet spot is 5.0; 7.0 is too much. D — Trim to 3 refs face + openpose + sweater + specific prompt curl -X POST http://localhost:8895/generate/skeleton -d '{ "prompt": "A young Asian woman wearing a gray oversized sweater dress, standing in a relaxed pose, full body shot, soft natural lighting, white studio background.", "ref image paths": "face","openpose","part 1" , "shift": 2.0, "seed": 42 }' Result : Major improvement. Background becomes a clean white studio, outfit is preserved, pose looks natural. Removing the bg ref made the biggest difference. This is what a correct try-on output should look like. E — 4 refs + numbered-ref prompt curl -X POST http://localhost:8895/generate/skeleton -d '{ "prompt": "Full body try-on photograph. Subject: the woman from image 1. Pose: identical to the skeleton in image 2. Wearing: the gray oversized knit sweater dress shown in image 3, brown leather ankle boots shown in image 4. Studio lighting, plain background.", "ref image paths": "face","openpose","part 1","part 2" , "shift": 2.0, "seed": 42 }' Result : Quality on par with D; boots reflected somewhat subtly . Numbering refs in the prompt does help , but the effect isn't dramatic. F — Drop openpose, specify pose via prompt curl -X POST http://localhost:8895/generate/skeleton -d '{ "prompt": "Full body photograph of the woman wearing the gray sweater dress and brown ankle boots, dynamic dancing pose with both arms raised above her head, joyful expression, photo studio with white seamless background, professional lighting.", "ref image paths": "face","part 1","part 2" , "shift": 2.5, "seed": 42 }' Result : 🏆 Both-arms-raised jump, complete success. Dynamic motion only appeared when the openpose ref was removed and the pose was specified purely via prompt. This confirms that the openpose ref suppresses prompt-driven pose. G — Face only + freeform prompt full outfit swap /generate/skeleton has a minimum-2-refs validation, so using /generate/ip : curl -X POST http://localhost:8895/generate/ip -d '{ "prompt": "Elegant full-body portrait of the woman wearing a vibrant red sequined evening gown with a thigh-high slit, standing confidently with one hand on her hip, soft cinematic lighting, dark blurred background.", "ref image paths": "face" , "shift": 3.0, "seed": 42 }' Result : 🏆 Red evening gown generated perfectly. Facial identity preserved; everything else is free. Face-only + shift=3.0 is the maximum-freedom pattern. H — Same config as E, seed=999 variance check curl -X POST http://localhost:8895/generate/skeleton -d '{ "prompt": "Full body try-on photograph. ...", "ref image paths": "face","openpose","part 1","part 2" , "shift": 2.0, "seed": 999 }' Result : Marginal difference from E; boots come out more clearly brown. Varying the seed is useful for fine-tuning details , so in production, running 3–5 seeds and picking best-of-N is standard practice. Layout Mode Quick Look 3 Bonus Patterns layout bboxes lets you specify where multiple subjects appear in the image using relative coordinates x1, x2, y1, y2 . Here's the actual behavior. Input refs are face photos of two people female, male : L1 — Side by side female left, male right "layout bboxes": " 0.0,0.5,0.1,0.95 , 0.5,1.0,0.1,0.95 " Result : Left and right were swapped male left, female right . Correspondence between ref order and bbox order is not guaranteed. L2 — Top/bottom split female top, male bottom "layout bboxes": " 0.2,0.8,0.0,0.5 , 0.2,0.8,0.5,1.0 " Result : Female appears in the background, male in the foreground — a depth-layered composition rather than a literal top/bottom split. L3 — Size difference female large, male small "layout bboxes": " 0.1,0.65,0.1,0.95 , 0.7,0.97,0.05,0.45 " Result : Both subjects rendered at nearly the same size, side by side. Bbox size does not control relative scale. → Think of layout mode as a loose composition hint for group shots , not precise Photoshop-style placement. It gives a rough suggestion for fitting multiple subjects into a single image; don't expect coordinate accuracy. Why This Happens — Reading pipeline.py HiDream's behavior is governed by the generate image function in models/pipeline.py . Three structural facts explain everything. 1. More refs = lower per-ref resolution pipeline.py:198-202 : if K == 1: max size = max height, width 2048 elif K == 2: max size = max height, width 48 // 64 1536 elif K <= 4: max size = max height, width // 2 1024 elif K <= 8: max size = max height, width 24 // 64 768 else: max size = max height, width // 4 512 Feeding 6 refs compresses each to 768px. Thin openpose lines, fine clothing patterns, and facial detail all get crushed. Keeping it to 3–4 refs preserves 1024px and retains that detail. 2. Skeleton mode has no dedicated code path Looking at pipeline.py:178-275 , there is no skeleton-specific branch. Both /generate/skeleton and /generate/ip run through exactly the same multi-ref path: content = {"type": "image"} for in range K content.append {"type": "text", "text": caption} messages = {"role": "user", "content": content} The model receives no role hints indicating which ref is a face, which is an openpose skeleton, and which is clothing. All refs are treated as "K reference images in parallel." If you want roles to matter, you have to say so explicitly in the prompt text. This is why "prompt beats openpose ref." The openpose ref is processed as "some line-art image among the references," with no explicit signal that it's a pose specification. Meanwhile, dynamic dancing pose with both arms raised in the prompt is parsed as explicit verbs and nouns at the vocabulary level. 3. How the shift parameter behaves shift controls the noise schedule strength of the scheduler. In practice: - 1.0 = maximum fidelity to ref composition, zero freedom → try-on only - 2.0-2.5 = practical range, allows deviation from refs - 3.0+ = near-freeform generation, refs serve only as identity anchors The README recommends 1.0 for IP/Skeleton/Layout because it assumes the typical try-on / character-consistency use case. If you want to change the pose, swap outfits, or build a new scene that differs from the refs, 2.0+ is required. Best Practices by Use Case Battle-Tested | Goal | Endpoint | Refs | Shift | Notes | |---|---|---|---|---| Faithful try-on matching original scene | /skeleton | 6 face+bg+pose+3parts | 1.0 | README default. Strongly faithful to all refs | Preserve outfit + natural standing pose | /skeleton | 3-4 face + clothing, no bg/pose | 2.0 | Dropping bg ref gives white studio; fewer refs keep each at 768→1024px | Dramatic pose change | /skeleton | 3 no openpose | 2.5 | Prompt controls motion better than openpose ref | Complete outfit swap | /ip | 1 face only | 3.0 | Maximum freedom; only face is preserved. Skeleton mode rejects < 2 refs | Group shot | /layout | Multiple face refs + rough bboxes | 1.0 | Bboxes are loose composition hints; size hierarchy doesn't work; ref↔bbox order not guaranteed | Fine detail optimization | Same config | Same | Same | Run 3–5 seeds and pick best-of-N | Summary Treating HiDream-O1-Image's skeleton mode as a "try-on simulator" leads to the frustrating feeling that "it won't listen" — with no guardrails to blame. The real cause is pipeline structure : refs lose resolution as count increases, there's no skeleton-specific processing, and shift controls how hard the