One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own

A developer has released one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image, a top-ranked open-weight text-to-image model that shipped inference-only. Because HiDream-O1-Image uses a Pixel-level Unified Transformer architecture without a VAE or separate text encoder, standard LoRA trainers like kohya and SimpleTuner cannot be used, so the developer reverse-engineered a working training loop from the inference code alone. The resulting ~150-line trainer produces a clean aesthetic LoRA that improves rendering quality, lighting, and stylization across diverse subjects, with a full open training recipe and before/after documentation.

HiDream-O1-Image https://huggingface.co/HiDream-ai/HiDream-O1-Image is one of the strongest open-weight text-to-image models out right now it debuted around 8 in the Artificial Analysis T2I Arena . But it shipped inference-only , and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can't touch it. This post is one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image . I'll show why the standard trainers kohya, ai-toolkit, SimpleTuner don't fit, how I reverse-engineered a working training loop from the inference code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night. What this LoRA is: a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It's not a character LoRA, not a single-style LoRA, and not a model-distillation artifact. The short version of the recipe: x0 -1,1 . z t = 1 - σ ·x0 + σ· 8.0·ε and feed the model timestep 1 - σ . MSE x pred, x0 on the image-token positions Qwen3-VL .To set expectations honestly: I'm not claiming "world's first LoRA file for O1." What I didn't find: a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image. If you know of one, I'd genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe. Most LoRA trainers assume the SDXL/Flux shape: a UNet/DiT denoiser + a VAE + one or two text encoders , all separate modules wired together by diffusers . You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings. HiDream-O1-Image is a Pixel-level Unified Transformer UiT . From its own description: a natively unified image generative foundation model built on a Pixel-level Unified Transformer without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space. Concretely reading models/qwen3 vl transformers.py : Qwen3VLForConditionalGeneration PATCH SIZE = 32 , so an H×W image becomes H/32 · W/32 tokens, each a 3·32·32 = 3072 -dim vector of raw pixels. x embedder projects the noised patch tokens into the hidden space; a final layer2 head projects hidden states back to patch space; a t embedder injects the timestep at a dedicated <|tms token| position. fm solvers unipc.py , and image tokens get token types controls .So none of kohya/ai-toolkit/SimpleTuner can touch it — there's no UNet, no VAE, no separate text encoder for them to hook. That's exactly why there are no articles: it's a new architecture, released inference-only. The good news: because the backbone is a plain transformers model , the LoRA nn.Linear s natively. The hard part is the The inference loop models/pipeline.py:generate image tells you everything. Per denoising step it does roughly: sigma = step t / 1000.0 noise level, in 0, 1 t pixeldit = 1.0 - sigma what the model receives as "timestep" x pred = model ..., vinputs=z, timestep=t pixeldit .x pred v = x pred - z / sigma ... and -v is fed to the FM scheduler Two facts fall out of this: x pred is the model's prediction of the clean image x0 . z t = 1-σ ·x0 + σ·ε then x pred - z t /σ = x0 - ε = - ε - x0 , and ε - x0 is exactly the rectified-flow velocity the FlowMatch scheduler expects. Consistent ⇒ the head is x0 -parameterized z = NOISE SCALE · randn with NOISE SCALE = 8.0 , while x0 lives in -1, 1 . So the interpolation the model was trained on is z t = 1-σ ·x0 + σ· 8.0·ε .That gives the entire training step: sigma = random.uniform T EPS, 1.0 eps = torch.randn like x0 z t = 1.0 - sigma x0 + sigma NOISE SCALE eps NOISE SCALE = 8.0 t = torch.tensor 1.0 - sigma out = gen input ids=ids, position ids=pos, vinputs=z t, timestep=t, token types=tt x pred = out.x pred 0, vinput mask 0 image-token positions only loss = F.mse loss x pred.float , x0 0 .float x0 is just the image, normalized to -1,1 and patchified with the same einops rearrange the pipeline uses for reference images. The token layout prompt → <|boi token| → <|tms token| → image tokens is built by reusing the pipeline's own build t2i text sample , so positions and token types line up with what the forward expects. Uniform σ sampling and unweighted x0 -MSE are enough to learn cleanly — no fancy loss weighting needed for a first cut. Because the denoiser is model.model.language model a stock Qwen3-VL decoder , PEFT targets its attention/MLP linears and freezes everything else: targets = n for n, m in model.named modules if isinstance m, torch.nn.Linear and n.endswith "q proj","k proj","v proj","o proj", "gate proj","up proj","down proj" and "language model" in n and "visual" not in n model = get peft model model, LoraConfig r=16, lora alpha=16, target modules=targets, lora dropout=0.0, bias="none" That's 252 linears, ~44M trainable params at rank 16. The vision encoder, x embedder , t embedder , and final layer2 stay frozen. One subtlety: PEFT swaps the Linear s in place , so a handle grabbed before get peft model gen = model.model still sees the LoRA layers — convenient for calling the generation forward directly and for model.disable adapter A/B renders. Resolution is not fixed at 2048. The find closest resolution snapping you see in the pipeline is a quality default the model is tuned for high res , not an architectural limit — height / width are free as long as they're multiples of 32. Since image tokens scale as H/32 · W/32 : | resolution | image tokens | relative attention cost | |---|---|---| | 2048² | 4096 | 1× | | 1024² | 1024 | ~1/16 | So I train at 1024 : ~4× shorter sequences, far less VRAM and time per step. The workflow becomes "iterate cheaply at 1024, upscale the keepers." Aspect ratios are left native each image snapped to the nearest ×32, batch size 1 — no bucketing needed, and mixed portrait/landscape actually helps a style LoRA generalize. For captions, HiDream wants natural-language prose , not danbooru tags different text encoder lineage . I captioned ~190 images with a local multimodal VLM into one-to-three-sentence descriptions, each prefixed with a trigger phrase so the aesthetic stays prompt-controllable invoke it when you want it, leave it off otherwise . Same prompt, same seed, adapter off vs on. All samples use the trigger phrase kotonia style : The base model is competent but soft and a bit generic; the LoRA pushes rendering toward a polished modern-anime look — directional lighting, glossier hair and skin, more confident stylization — and it holds across very different subjects schoolgirl slice-of-life → epic fantasy , so it learned an aesthetic rather than memorizing images. Same prompt, same seed, rank 16, ~190 images: It keeps refining without melting or obvious overfitting even at 2500 steps — the sweet spot is further out than I expected for a set this small. Loss drifts ~0.07 → 0.052. NSFW content controllability prompt-gating was also tested as part of this LoRA — the model produces NSFW only when explicitly prompted, and the LoRA's contribution is primarily visual quality rather than "uncensoring." For the full story including training data composition, motivation, and NSFW samples, see the companion article on kotonia.ai https://kotonia.ai/articles/hidream-o1-lora-why . The whole trainer is ~150 lines. Run: uv pip install peft CUDA VISIBLE DEVICES=0 python train lora.py \ --data dir /path/to/images \ --out dir outputs/lora run \ --resolution 1024 --steps 2500 --rank 16 \ --sample every 500 --sample prompt "<trigger , ..." --sample every renders an adapter on/off pair so you can watch the LoRA bite. Inference loads the base model, applies the adapter with PeftModel.from pretrained , and generates — disable adapter gives you the baseline for free. from pretrained ... .to device materializes the full 8B model in CPU RAM before moving it to GPU; on a 60 GB host alongside other services this got OOM-killed mid-load. low cpu mem usage=True streams the shards and fixes it. height / width multiples of 32 and bypass the bucket snapping entirely. setsid / tmux /systemd — if it's a child of your editor's terminal, an editor crash takes the run and any GPU services in sibling terminals down with it. x0 -param, not v-param. x0 directly; if you assume velocity prediction the loss won't match the head and the LoRA won't converge to the right manifold. Companion article the story behind this LoRA : Why I Trained a HiDream-O1 LoRA — on kotonia.ai. The LoRA is available on kotonia.ai/studio https://kotonia.ai/studio my own creative platform where I serve the model alongside the LoRA, free to use . The full trainer code, captioning pipeline, and inference scripts are in the GitHub repo https://github.com/zhener562/hage under HiDream-O1-Image/ . If you train something cool with this recipe — a character LoRA, a style LoRA, an NSFW-enhancing LoRA — I'd love to see it. The more community LoRAs exist for O1, the better for everyone.