{"slug": "one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own", "title": "One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own", "summary": "A developer has released one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image, a top-ranked open-weight text-to-image model that shipped inference-only. Because HiDream-O1-Image uses a Pixel-level Unified Transformer architecture without a VAE or separate text encoder, standard LoRA trainers like kohya and SimpleTuner cannot be used, so the developer reverse-engineered a working training loop from the inference code alone. The resulting ~150-line trainer produces a clean aesthetic LoRA that improves rendering quality, lighting, and stylization across diverse subjects, with a full open training recipe and before/after documentation.", "body_md": "[HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) is one of the strongest open-weight text-to-image models out right now (it debuted around **#8 in the Artificial Analysis T2I Arena**). But it shipped **inference-only**, and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can't touch it.\n\nThis post is **one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image**. I'll show why the standard trainers (kohya, ai-toolkit, SimpleTuner) don't fit, how I reverse-engineered a working training loop from the *inference* code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night.\n\n**What this LoRA is:** a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It's not a character LoRA, not a single-style LoRA, and not a model-distillation artifact.\n\nThe short version of the recipe:\n\n`x0`\n\n`[-1,1]`\n\n).`z_t = (1 - σ)·x0 + σ·(8.0·ε)`\n\nand feed the model timestep `1 - σ`\n\n.`MSE(x_pred, x0)`\n\non the image-token positions`Qwen3-VL`\n\n.To set expectations honestly: I'm not claiming \"world's first LoRA file for O1.\"\n\nWhat I *didn't* find: **a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image.** If you know of one, I'd genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe.\n\nMost LoRA trainers assume the SDXL/Flux shape: a **UNet/DiT** denoiser + a **VAE** + one or two **text encoders**, all separate modules wired together by `diffusers`\n\n. You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings.\n\nHiDream-O1-Image is a **Pixel-level Unified Transformer (UiT)**. From its own description:\n\na natively unified image generative foundation model built on a Pixel-level Unified Transformer\n\nwithout external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space.\n\nConcretely (reading `models/qwen3_vl_transformers.py`\n\n):\n\n`Qwen3VLForConditionalGeneration`\n\n`PATCH_SIZE = 32`\n\n, so an `H×W`\n\nimage becomes `(H/32)·(W/32)`\n\ntokens, each a `3·32·32 = 3072`\n\n-dim vector of raw pixels.`x_embedder`\n\nprojects the noised patch tokens into the hidden space; a `final_layer2`\n\nhead projects hidden states back to patch space; a `t_embedder`\n\ninjects the timestep at a dedicated `<|tms_token|>`\n\nposition.`fm_solvers_unipc.py`\n\n), and image tokens get `token_types`\n\ncontrols).So none of kohya/ai-toolkit/SimpleTuner can touch it — there's no UNet, no VAE, no separate text encoder for them to hook. That's exactly *why* there are no articles: it's a new architecture, released inference-only.\n\nThe good news: because the backbone is a **plain transformers model**, the LoRA\n\n`nn.Linear`\n\ns natively. The hard part is the The inference loop (`models/pipeline.py:generate_image`\n\n) tells you everything. Per denoising step it does roughly:\n\n```\nsigma = step_t / 1000.0                       # noise level, in (0, 1]\nt_pixeldit = 1.0 - sigma                       # what the model receives as \"timestep\"\nx_pred = model(..., vinputs=z, timestep=t_pixeldit).x_pred\nv = (x_pred - z) / sigma                        # ... and -v is fed to the FM scheduler\n```\n\nTwo facts fall out of this:\n\n`x_pred`\n\nis the model's prediction of the clean image `x0`\n\n.`z_t = (1-σ)·x0 + σ·ε`\n\nthen `(x_pred - z_t)/σ = x0 - ε = -(ε - x0)`\n\n, and `ε - x0`\n\nis exactly the rectified-flow velocity the `FlowMatch`\n\nscheduler expects. Consistent ⇒ the head is `x0`\n\n-parameterized`z = NOISE_SCALE · randn`\n\nwith `NOISE_SCALE = 8.0`\n\n, while `x0`\n\nlives in `[-1, 1]`\n\n. So the interpolation the model was trained on is `z_t = (1-σ)·x0 + σ·(8.0·ε)`\n\n.That gives the entire training step:\n\n```\nsigma = random.uniform(T_EPS, 1.0)\neps   = torch.randn_like(x0)\nz_t   = (1.0 - sigma) * x0 + sigma * (NOISE_SCALE * eps)   # NOISE_SCALE = 8.0\nt     = torch.tensor([1.0 - sigma])\n\nout    = gen(input_ids=ids, position_ids=pos, vinputs=z_t,\n             timestep=t, token_types=tt)\nx_pred = out.x_pred[0, vinput_mask[0]]      # image-token positions only\nloss   = F.mse_loss(x_pred.float(), x0[0].float())\n```\n\n`x0`\n\nis just the image, normalized to `[-1,1]`\n\nand patchified with the same `einops`\n\nrearrange the pipeline uses for reference images. The token layout (prompt → `<|boi_token|>`\n\n→ `<|tms_token|>`\n\n→ image tokens) is built by reusing the pipeline's own `build_t2i_text_sample`\n\n, so positions and `token_types`\n\nline up with what the forward expects.\n\nUniform `σ`\n\nsampling and unweighted `x0`\n\n-MSE are enough to learn cleanly — no fancy loss weighting needed for a first cut.\n\nBecause the denoiser is `model.model.language_model`\n\n(a stock Qwen3-VL decoder), PEFT targets its attention/MLP linears and freezes everything else:\n\n```\ntargets = [n for n, m in model.named_modules()\n           if isinstance(m, torch.nn.Linear)\n           and n.endswith((\"q_proj\",\"k_proj\",\"v_proj\",\"o_proj\",\n                           \"gate_proj\",\"up_proj\",\"down_proj\"))\n           and \"language_model\" in n and \"visual\" not in n]\n\nmodel = get_peft_model(model, LoraConfig(\n    r=16, lora_alpha=16, target_modules=targets, lora_dropout=0.0, bias=\"none\"))\n```\n\nThat's **252 linears, ~44M trainable params** at rank 16. The vision encoder, `x_embedder`\n\n, `t_embedder`\n\n, and `final_layer2`\n\nstay frozen. One subtlety: PEFT swaps the `Linear`\n\ns **in place**, so a handle grabbed before `get_peft_model`\n\n(`gen = model.model`\n\n) still sees the LoRA layers — convenient for calling the generation forward directly and for `model.disable_adapter()`\n\nA/B renders.\n\n**Resolution is not fixed at 2048.** The `find_closest_resolution()`\n\nsnapping you see in the pipeline is a *quality default* (the model is tuned for high res), not an architectural limit — `height`\n\n/`width`\n\nare free as long as they're multiples of 32. Since image tokens scale as `(H/32)·(W/32)`\n\n:\n\n| resolution | image tokens | relative attention cost |\n|---|---|---|\n| 2048² | 4096 | 1× |\n| 1024² | 1024 | ~1/16 |\n\nSo I train at **1024**: ~4× shorter sequences, far less VRAM and time per step. The workflow becomes \"iterate cheaply at 1024, upscale the keepers.\" Aspect ratios are left native (each image snapped to the nearest ×32, batch size 1) — no bucketing needed, and mixed portrait/landscape actually helps a style LoRA generalize.\n\nFor captions, HiDream wants **natural-language prose**, not danbooru tags (different text encoder lineage). I captioned ~190 images with a local multimodal VLM into one-to-three-sentence descriptions, each prefixed with a trigger phrase so the aesthetic stays **prompt-controllable** (invoke it when you want it, leave it off otherwise).\n\nSame prompt, same seed, adapter off vs on. All samples use the trigger phrase `kotonia style`\n\n:\n\nThe base model is competent but soft and a bit generic; the LoRA pushes rendering toward a polished modern-anime look — directional lighting, glossier hair and skin, more confident stylization — and it holds across very different subjects (schoolgirl slice-of-life → epic fantasy), so it learned an *aesthetic* rather than memorizing images.\n\nSame prompt, same seed, rank 16, ~190 images:\n\nIt keeps **refining** without melting or obvious overfitting even at 2500 steps — the sweet spot is further out than I expected for a set this small. (Loss drifts ~0.07 → 0.052.)\n\nNSFW content controllability (prompt-gating) was also tested as part of this LoRA — the model produces NSFW only when explicitly prompted, and the LoRA's contribution is primarily visual quality rather than \"uncensoring.\" For the full story including training data composition, motivation, and NSFW samples, see the [companion article on kotonia.ai](https://kotonia.ai/articles/hidream-o1-lora-why).\n\nThe whole trainer is ~150 lines. Run:\n\n```\nuv pip install peft\nCUDA_VISIBLE_DEVICES=0 python train_lora.py \\\n  --data_dir /path/to/images \\\n  --out_dir outputs/lora_run \\\n  --resolution 1024 --steps 2500 --rank 16 \\\n  --sample_every 500 --sample_prompt \"<trigger>, ...\"\n```\n\n`--sample_every`\n\nrenders an adapter on/off pair so you can watch the LoRA bite. Inference loads the base model, applies the adapter with `PeftModel.from_pretrained`\n\n, and generates — `disable_adapter()`\n\ngives you the baseline for free.\n\n`from_pretrained(...).to(device)`\n\nmaterializes the full 8B model in CPU RAM before moving it to GPU; on a 60 GB host alongside other services this got OOM-killed mid-load. `low_cpu_mem_usage=True`\n\nstreams the shards and fixes it.`height`\n\n/`width`\n\n(multiples of 32) and bypass the bucket snapping entirely.`setsid`\n\n/`tmux`\n\n/systemd — if it's a child of your editor's terminal, an editor crash takes the run (and any GPU services in sibling terminals) down with it.`x0`\n\n-param, not v-param.`x0`\n\ndirectly; if you assume velocity prediction the loss won't match the head and the LoRA won't converge to the right manifold.*Companion article (the story behind this LoRA): Why I Trained a HiDream-O1 LoRA — on kotonia.ai.*\n\nThe LoRA is available on [kotonia.ai/studio](https://kotonia.ai/studio) (my own creative platform where I serve the model alongside the LoRA, free to use). The full trainer code, captioning pipeline, and inference scripts are in the [GitHub repo](https://github.com/zhener562/hage) under `HiDream-O1-Image/`\n\n.\n\nIf you train something cool with this recipe — a character LoRA, a style LoRA, an NSFW-enhancing LoRA — I'd love to see it. The more community LoRAs exist for O1, the better for everyone.", "url": "https://wpnews.pro/news/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own-5hd1", "published_at": "2026-05-26 23:45:33+00:00", "updated_at": "2026-05-27 00:03:30.749861+00:00", "lang": "en", "topics": ["generative-ai", "machine-learning", "ai-research", "computer-vision", "ai-tools"], "entities": ["HiDream-O1-Image", "Artificial Analysis T2I Arena", "kohya", "ai-toolkit", "SimpleTuner", "Qwen3-VL", "HiDream-ai"], "alternates": {"html": "https://wpnews.pro/news/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own", "markdown": "https://wpnews.pro/news/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own.md", "text": "https://wpnews.pro/news/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own.txt", "jsonld": "https://wpnews.pro/news/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own.jsonld"}}