{"slug": "hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked", "title": "HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked", "summary": "Benchmarking the HiDream-O1-Image model revealed that its \"skeleton mode\" does not have a dedicated code path and instead processes all reference images (face, background, pose) through the same pipeline, relying solely on the prompt for differentiation. The key finding was that including a background reference image severely limits the model's ability to follow pose instructions, and removing the background ref while using a shift value of 2.0 produced the best results for natural, instruction-following try-on outputs.", "body_md": "## TL;DR\n\nAfter benchmarking [HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) (released 2026-05, OpenWeight 8B, ranked #8 on Artificial Analysis Text-to-Image Arena) across 8 skeleton (try-on) mode patterns plus 3 layout patterns, three counterintuitive findings emerged.\n\n-\n**Passing an openpose ref actually locks the pose to the ref's composition.** When you want dynamic poses, dropping the openpose ref and specifying the pose via prompt is more effective. - Using 6 refs (face + bg + pose + parts, the full set) compresses each ref down to**768px, degrading fine details.** Keeping it to 3–4 refs maintains 1024px and produces better quality. - The README-recommended\n`shift=1.0`\n\nis strictly for try-on use. For pose/outfit swaps use`shift=2.0-2.5`\n\n; for complete scene replacement use`shift=3.0`\n\n.\n\nReading `pipeline.py`\n\nreveals that**there is no dedicated code path for skeleton mode.** Both `/generate/skeleton`\n\nand `/generate/ip`\n\ngo through exactly the same multi-ref pipeline internally, and whether a ref is a face, background, openpose, or clothing is**communicated only through the prompt**. That's the root cause of everything.\n\n## Motivation\n\nAfter running HiDream-O1-Image on a local GPU (RTX PRO 6000 Blackwell, 96 GB) and integrating it into our own platform, we hit a problem:**skeleton (try-on) mode wasn't following prompt instructions.** Writing \"jump with both hands raised\" only produced stiff, upright try-on photos.\n\nSuspecting guardrails (NSFW filters, safety policies, etc.), I grepped for `safety|nsfw|guard|filter|moderate|censor`\n\n—**HiDream's codebase has none of that**(the only hit was CSS `backdrop-filter: blur`\n\n). As expected from an MIT-licensed OpenWeight model, no censorship.\n\nSo what's actually wrong? Here's what I found after reading `pipeline.py`\n\nand running 8 + 3 patterns on real hardware.\n\n## Environment\n\n-**GPU**: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM) -** PyTorch**: 2.12.0 + CUDA 13.0 -** flash-attn**: 2.8.3 (sm_120-only build) -** Model**: HiDream-O1-Image Full (8B, bf16, ~16.4 GiB resident) -** Inference server**: custom Python BaseHTTPRequestHandler (port 8895) -** Resolution**: pipeline internal bucket forces snap to 2048×2048\n\nMeasured wall time per 50-step generation:\n\n| Mode | Time | iter speed |\n|---|---|---|\n| t2i (no ref) | ~33s | 1.52 it/s |\n| edit (1 ref) | ~76s | 1.01 it/s |\n| skeleton (multi ref) | ~84s | 1.34 it/s |\n| ip (multi ref) | ~76s | 1.81 it/s |\n| layout (multi ref + bbox) | ~83s | 1.21 it/s |\n\n## Test Assets\n\nThe HiDream repo's `assets/IP_skeleton/`\n\nincludes a full skeleton set. These are used as-is for all tests.\n\n| ref | Content | Intended role |\n|---|---|---|\n| Person's face photo | Identity reference | |\n| Stick figure in OpenPose format | Pose specification | |\n| Background photo (interior) | Scene reference | |\n| Clothing parts (sweater, boots) | Outfit reference |\n\n## 8-Pattern Skeleton Benchmark\n\nEach pattern calls `/api/studio/skeleton`\n\n(i.e., `generate_image()`\n\nwith skeleton-mode-equivalent arguments). All parameters except `shift`\n\nand `guidance_scale`\n\nare fixed (50 steps, seed=42).\n\n### A — Baseline (README defaults, all 6 refs)\n\n```\ncurl -X POST http://localhost:8895/generate/skeleton \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"prompt\": \"Create a realistic try-on image of the person wearing the provided clothing.\",\n    \"ref_image_paths\": [\"face\",\"bg\",\"openpose\",\"part_1\",\"part_2\",\"part_3\"],\n    \"shift\": 1.0, \"seed\": 42\n  }'\n```**Result**: The bg ref's walls and shelves are reproduced exactly. Pose also matches the openpose ref's upright stance. Faithful as a try-on, but zero freedom of movement.\n\n### B — Higher shift (same 6 refs, shift=2.5)\n\n```\ncurl -X POST http://localhost:8895/generate/skeleton -d '{\n  \"prompt\": \"Create a realistic try-on image of the person wearing the provided clothing.\",\n  \"ref_image_paths\": [\"face\",\"bg\",\"openpose\",\"part_1\",\"part_2\",\"part_3\"],\n  \"shift\": 2.5, \"seed\": 42\n}'\ncurl -X POST http://localhost:8895/generate/skeleton -d '{\n  \"prompt\": \"...\",\n  \"ref_image_paths\": [...6 refs...],\n  \"shift\": 2.5, \"guidance_scale\": 7.0, \"seed\": 42\n}'\n```**Result**: Necklace deforms strangely.** Raising guidance starts producing artifacts.**The Full model's sweet spot is 5.0; 7.0 is too much.\n\n### D — Trim to 3 refs (face + openpose + sweater) + specific prompt\n\n```\ncurl -X POST http://localhost:8895/generate/skeleton -d '{\n  \"prompt\": \"A young Asian woman wearing a gray oversized sweater dress, standing in a relaxed pose, full body shot, soft natural lighting, white studio background.\",\n  \"ref_image_paths\": [\"face\",\"openpose\",\"part_1\"],\n  \"shift\": 2.0, \"seed\": 42\n}'\n```**Result**:** Major improvement.**Background becomes a clean white studio, outfit is preserved, pose looks natural. Removing the bg ref made the biggest difference. This is what a correct try-on output should look like.\n\n### E — 4 refs + numbered-ref prompt\n\n```\ncurl -X POST http://localhost:8895/generate/skeleton -d '{\n  \"prompt\": \"Full body try-on photograph. Subject: the woman from image 1. Pose: identical to the skeleton in image 2. Wearing: the gray oversized knit sweater dress shown in image 3, brown leather ankle boots shown in image 4. Studio lighting, plain background.\",\n  \"ref_image_paths\": [\"face\",\"openpose\",\"part_1\",\"part_2\"],\n  \"shift\": 2.0, \"seed\": 42\n}'\n```**Result**: Quality on par with D; boots reflected (somewhat subtly).** Numbering refs in the prompt does help**, but the effect isn't dramatic.\n\n### F — Drop openpose, specify pose via prompt\n\n```\ncurl -X POST http://localhost:8895/generate/skeleton -d '{\n  \"prompt\": \"Full body photograph of the woman wearing the gray sweater dress and brown ankle boots, dynamic dancing pose with both arms raised above her head, joyful expression, photo studio with white seamless background, professional lighting.\",\n  \"ref_image_paths\": [\"face\",\"part_1\",\"part_2\"],\n  \"shift\": 2.5, \"seed\": 42\n}'\n```**Result**: 🏆** Both-arms-raised jump, complete success.**Dynamic motion only appeared when the openpose ref was removed and the pose was specified purely via prompt.** This confirms that the openpose ref suppresses prompt-driven pose.**### G — Face only + freeform prompt (full outfit swap)\n\n`/generate/skeleton`\n\nhas a minimum-2-refs validation, so using `/generate/ip`\n\n:\n\n```\ncurl -X POST http://localhost:8895/generate/ip -d '{\n  \"prompt\": \"Elegant full-body portrait of the woman wearing a vibrant red sequined evening gown with a thigh-high slit, standing confidently with one hand on her hip, soft cinematic lighting, dark blurred background.\",\n  \"ref_image_paths\": [\"face\"],\n  \"shift\": 3.0, \"seed\": 42\n}'\n```**Result**: 🏆** Red evening gown generated perfectly.**Facial identity preserved; everything else is free.** Face-only + shift=3.0** is the maximum-freedom pattern.\n\n### H — Same config as E, seed=999 (variance check)\n\n```\ncurl -X POST http://localhost:8895/generate/skeleton -d '{\n  \"prompt\": \"Full body try-on photograph. ...\",\n  \"ref_image_paths\": [\"face\",\"openpose\",\"part_1\",\"part_2\"],\n  \"shift\": 2.0, \"seed\": 999\n}'\n```**Result**: Marginal difference from E; boots come out more clearly brown.** Varying the seed is useful for fine-tuning details**, so in production, running 3–5 seeds and picking best-of-N is standard practice.\n\n## Layout Mode Quick Look (3 Bonus Patterns)\n\n`layout_bboxes`\n\nlets you specify where multiple subjects appear in the image using relative coordinates `[x1, x2, y1, y2]`\n\n. Here's the actual behavior.\n\nInput refs are face photos of two people (female, male):\n\n### L1 — Side by side (female left, male right)\n\n```\n\"layout_bboxes\": \"[[0.0,0.5,0.1,0.95],[0.5,1.0,0.1,0.95]]\"\n```**Result**:** Left and right were swapped**(male left, female right). Correspondence between ref order and bbox order is not guaranteed.\n\n### L2 — Top/bottom split (female top, male bottom)\n\n```\n\"layout_bboxes\": \"[[0.2,0.8,0.0,0.5],[0.2,0.8,0.5,1.0]]\"\n```**Result**: Female appears in the background, male in the foreground — a depth-layered composition rather than a literal top/bottom split.\n\n### L3 — Size difference (female large, male small)\n\n```\n\"layout_bboxes\": \"[[0.1,0.65,0.1,0.95],[0.7,0.97,0.05,0.45]]\"\n```**Result**: Both subjects rendered at nearly the same size, side by side.** Bbox size does not control relative scale.**→ Think of layout mode as a** loose composition hint for group shots**, not precise Photoshop-style placement. It gives a rough suggestion for fitting multiple subjects into a single image; don't expect coordinate accuracy.\n\n## Why This Happens — Reading `pipeline.py`\n\nHiDream's behavior is governed by the `generate_image()`\n\nfunction in `models/pipeline.py`\n\n. Three structural facts explain everything.\n\n### 1. More refs = lower per-ref resolution\n\n`pipeline.py:198-202`\n\n:\n\n```\nif K == 1: max_size = max(height, width)         # 2048\nelif K == 2: max_size = max(height, width) * 48 // 64   # 1536\nelif K <= 4: max_size = max(height, width) // 2  # 1024\nelif K <= 8: max_size = max(height, width) * 24 // 64   # 768\nelse: max_size = max(height, width) // 4         # 512\n```\n\n**Feeding 6 refs compresses each to 768px.** Thin openpose lines, fine clothing patterns, and facial detail all get crushed. Keeping it to 3–4 refs preserves 1024px and retains that detail.\n\n### 2. Skeleton mode has no dedicated code path\n\nLooking at `pipeline.py:178-275`\n\n,**there is no skeleton-specific branch.** Both `/generate/skeleton`\n\nand `/generate/ip`\n\nrun through exactly the same multi-ref path:\n\n```\ncontent = [{\"type\": \"image\"} for _ in range(K)]\ncontent.append({\"type\": \"text\", \"text\": caption})\nmessages = [{\"role\": \"user\", \"content\": content}]\n```\n\nThe model receives**no role hints** indicating which ref is a face, which is an openpose skeleton, and which is clothing. All refs are treated as \"K reference images in parallel.\" If you want roles to matter,**you have to say so explicitly in the prompt text.** This is why \"prompt beats openpose ref.\" The openpose ref is processed as \"some line-art image among the references,\" with no explicit signal that it's a pose specification. Meanwhile, `dynamic dancing pose with both arms raised`\n\nin the prompt is parsed as explicit verbs and nouns at the vocabulary level.\n\n### 3. How the `shift`\n\nparameter behaves\n\n`shift`\n\ncontrols the noise schedule strength of the scheduler. In practice:\n\n-**1.0**= maximum fidelity to ref composition, zero freedom → try-on only -** 2.0-2.5**= practical range, allows deviation from refs -** 3.0+**= near-freeform generation, refs serve only as identity anchors\n\nThe README recommends 1.0 for IP/Skeleton/Layout because it assumes the typical try-on / character-consistency use case.**If you want to change the pose, swap outfits, or build a new scene that differs from the refs, 2.0+ is required.**## Best Practices by Use Case (Battle-Tested)\n\n| Goal | Endpoint | Refs | Shift | Notes |\n|---|---|---|---|---|\nFaithful try-on matching original scene |\n`/skeleton` |\n6 (face+bg+pose+3parts) | 1.0 | README default. Strongly faithful to all refs |\nPreserve outfit + natural standing pose |\n`/skeleton` |\n3-4 (face + clothing, no bg/pose) |\n2.0 |\nDropping bg ref gives white studio; fewer refs keep each at 768→1024px |\nDramatic pose change |\n`/skeleton` |\n3 (no openpose) | 2.5 |\nPrompt controls motion better than openpose ref |\nComplete outfit swap |\n`/ip` |\n1 (face only) | 3.0 |\nMaximum freedom; only face is preserved. Skeleton mode rejects < 2 refs |\nGroup shot |\n`/layout` |\nMultiple face refs + rough bboxes | 1.0 | Bboxes are loose composition hints; size hierarchy doesn't work; ref↔bbox order not guaranteed |\nFine detail optimization |\nSame config | Same | Same | Run 3–5 seeds and pick best-of-N |\n\n## Summary\n\nTreating HiDream-O1-Image's skeleton mode as a \"try-on simulator\" leads to the frustrating feeling that \"it won't listen\" — with no guardrails to blame. The real cause is**pipeline structure**: refs lose resolution as count increases, there's no skeleton-specific processing, and `shift`\n\ncontrols how hard the ", "url": "https://wpnews.pro/news/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked-3bm7", "published_at": "2026-05-22 11:23:05+00:00", "updated_at": "2026-05-22 11:35:15.954298+00:00", "lang": "en", "topics": ["artificial-intelligence", "open-source", "research", "products"], "entities": ["HiDream-O1-Image", "Artificial Analysis", "RTX PRO 6000 Blackwell", "OpenPose"], "alternates": {"html": "https://wpnews.pro/news/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked", "markdown": "https://wpnews.pro/news/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked.md", "text": "https://wpnews.pro/news/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked.txt", "jsonld": "https://wpnews.pro/news/hidream-skeleton-mode-prompt-beats-openpose-ref-8-patterns-benchmarked.jsonld"}}