# One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own

> Source: <https://dev.to/shinji_shimizu_bb51276a5e/one-of-the-first-public-hidream-o1-image-loras-and-how-to-train-your-own-5hd1>
> Published: 2026-05-26 23:45:33+00:00

[HiDream-O1-Image](https://huggingface.co/HiDream-ai/HiDream-O1-Image) is one of the strongest open-weight text-to-image models out right now (it debuted around **#8 in the Artificial Analysis T2I Arena**). But it shipped **inference-only**, and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can't touch it.

This post is **one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image**. I'll show why the standard trainers (kohya, ai-toolkit, SimpleTuner) don't fit, how I reverse-engineered a working training loop from the *inference* code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night.

**What this LoRA is:** a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It's not a character LoRA, not a single-style LoRA, and not a model-distillation artifact.

The short version of the recipe:

`x0`

`[-1,1]`

).`z_t = (1 - σ)·x0 + σ·(8.0·ε)`

and feed the model timestep `1 - σ`

.`MSE(x_pred, x0)`

on the image-token positions`Qwen3-VL`

.To set expectations honestly: I'm not claiming "world's first LoRA file for O1."

What I *didn't* find: **a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image.** If you know of one, I'd genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe.

Most LoRA trainers assume the SDXL/Flux shape: a **UNet/DiT** denoiser + a **VAE** + one or two **text encoders**, all separate modules wired together by `diffusers`

. You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings.

HiDream-O1-Image is a **Pixel-level Unified Transformer (UiT)**. From its own description:

a natively unified image generative foundation model built on a Pixel-level Unified Transformer

without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space.

Concretely (reading `models/qwen3_vl_transformers.py`

):

`Qwen3VLForConditionalGeneration`

`PATCH_SIZE = 32`

, so an `H×W`

image becomes `(H/32)·(W/32)`

tokens, each a `3·32·32 = 3072`

-dim vector of raw pixels.`x_embedder`

projects the noised patch tokens into the hidden space; a `final_layer2`

head projects hidden states back to patch space; a `t_embedder`

injects the timestep at a dedicated `<|tms_token|>`

position.`fm_solvers_unipc.py`

), and image tokens get `token_types`

controls).So none of kohya/ai-toolkit/SimpleTuner can touch it — there's no UNet, no VAE, no separate text encoder for them to hook. That's exactly *why* there are no articles: it's a new architecture, released inference-only.

The good news: because the backbone is a **plain transformers model**, the LoRA

`nn.Linear`

s natively. The hard part is the The inference loop (`models/pipeline.py:generate_image`

) tells you everything. Per denoising step it does roughly:

```
sigma = step_t / 1000.0                       # noise level, in (0, 1]
t_pixeldit = 1.0 - sigma                       # what the model receives as "timestep"
x_pred = model(..., vinputs=z, timestep=t_pixeldit).x_pred
v = (x_pred - z) / sigma                        # ... and -v is fed to the FM scheduler
```

Two facts fall out of this:

`x_pred`

is the model's prediction of the clean image `x0`

.`z_t = (1-σ)·x0 + σ·ε`

then `(x_pred - z_t)/σ = x0 - ε = -(ε - x0)`

, and `ε - x0`

is exactly the rectified-flow velocity the `FlowMatch`

scheduler expects. Consistent ⇒ the head is `x0`

-parameterized`z = NOISE_SCALE · randn`

with `NOISE_SCALE = 8.0`

, while `x0`

lives in `[-1, 1]`

. So the interpolation the model was trained on is `z_t = (1-σ)·x0 + σ·(8.0·ε)`

.That gives the entire training step:

```
sigma = random.uniform(T_EPS, 1.0)
eps   = torch.randn_like(x0)
z_t   = (1.0 - sigma) * x0 + sigma * (NOISE_SCALE * eps)   # NOISE_SCALE = 8.0
t     = torch.tensor([1.0 - sigma])

out    = gen(input_ids=ids, position_ids=pos, vinputs=z_t,
             timestep=t, token_types=tt)
x_pred = out.x_pred[0, vinput_mask[0]]      # image-token positions only
loss   = F.mse_loss(x_pred.float(), x0[0].float())
```

`x0`

is just the image, normalized to `[-1,1]`

and patchified with the same `einops`

rearrange the pipeline uses for reference images. The token layout (prompt → `<|boi_token|>`

→ `<|tms_token|>`

→ image tokens) is built by reusing the pipeline's own `build_t2i_text_sample`

, so positions and `token_types`

line up with what the forward expects.

Uniform `σ`

sampling and unweighted `x0`

-MSE are enough to learn cleanly — no fancy loss weighting needed for a first cut.

Because the denoiser is `model.model.language_model`

(a stock Qwen3-VL decoder), PEFT targets its attention/MLP linears and freezes everything else:

```
targets = [n for n, m in model.named_modules()
           if isinstance(m, torch.nn.Linear)
           and n.endswith(("q_proj","k_proj","v_proj","o_proj",
                           "gate_proj","up_proj","down_proj"))
           and "language_model" in n and "visual" not in n]

model = get_peft_model(model, LoraConfig(
    r=16, lora_alpha=16, target_modules=targets, lora_dropout=0.0, bias="none"))
```

That's **252 linears, ~44M trainable params** at rank 16. The vision encoder, `x_embedder`

, `t_embedder`

, and `final_layer2`

stay frozen. One subtlety: PEFT swaps the `Linear`

s **in place**, so a handle grabbed before `get_peft_model`

(`gen = model.model`

) still sees the LoRA layers — convenient for calling the generation forward directly and for `model.disable_adapter()`

A/B renders.

**Resolution is not fixed at 2048.** The `find_closest_resolution()`

snapping you see in the pipeline is a *quality default* (the model is tuned for high res), not an architectural limit — `height`

/`width`

are free as long as they're multiples of 32. Since image tokens scale as `(H/32)·(W/32)`

:

| resolution | image tokens | relative attention cost |
|---|---|---|
| 2048² | 4096 | 1× |
| 1024² | 1024 | ~1/16 |

So I train at **1024**: ~4× shorter sequences, far less VRAM and time per step. The workflow becomes "iterate cheaply at 1024, upscale the keepers." Aspect ratios are left native (each image snapped to the nearest ×32, batch size 1) — no bucketing needed, and mixed portrait/landscape actually helps a style LoRA generalize.

For captions, HiDream wants **natural-language prose**, not danbooru tags (different text encoder lineage). I captioned ~190 images with a local multimodal VLM into one-to-three-sentence descriptions, each prefixed with a trigger phrase so the aesthetic stays **prompt-controllable** (invoke it when you want it, leave it off otherwise).

Same prompt, same seed, adapter off vs on. All samples use the trigger phrase `kotonia style`

:

The base model is competent but soft and a bit generic; the LoRA pushes rendering toward a polished modern-anime look — directional lighting, glossier hair and skin, more confident stylization — and it holds across very different subjects (schoolgirl slice-of-life → epic fantasy), so it learned an *aesthetic* rather than memorizing images.

Same prompt, same seed, rank 16, ~190 images:

It keeps **refining** without melting or obvious overfitting even at 2500 steps — the sweet spot is further out than I expected for a set this small. (Loss drifts ~0.07 → 0.052.)

NSFW content controllability (prompt-gating) was also tested as part of this LoRA — the model produces NSFW only when explicitly prompted, and the LoRA's contribution is primarily visual quality rather than "uncensoring." For the full story including training data composition, motivation, and NSFW samples, see the [companion article on kotonia.ai](https://kotonia.ai/articles/hidream-o1-lora-why).

The whole trainer is ~150 lines. Run:

```
uv pip install peft
CUDA_VISIBLE_DEVICES=0 python train_lora.py \
  --data_dir /path/to/images \
  --out_dir outputs/lora_run \
  --resolution 1024 --steps 2500 --rank 16 \
  --sample_every 500 --sample_prompt "<trigger>, ..."
```

`--sample_every`

renders an adapter on/off pair so you can watch the LoRA bite. Inference loads the base model, applies the adapter with `PeftModel.from_pretrained`

, and generates — `disable_adapter()`

gives you the baseline for free.

`from_pretrained(...).to(device)`

materializes the full 8B model in CPU RAM before moving it to GPU; on a 60 GB host alongside other services this got OOM-killed mid-load. `low_cpu_mem_usage=True`

streams the shards and fixes it.`height`

/`width`

(multiples of 32) and bypass the bucket snapping entirely.`setsid`

/`tmux`

/systemd — if it's a child of your editor's terminal, an editor crash takes the run (and any GPU services in sibling terminals) down with it.`x0`

-param, not v-param.`x0`

directly; if you assume velocity prediction the loss won't match the head and the LoRA won't converge to the right manifold.*Companion article (the story behind this LoRA): Why I Trained a HiDream-O1 LoRA — on kotonia.ai.*

The LoRA is available on [kotonia.ai/studio](https://kotonia.ai/studio) (my own creative platform where I serve the model alongside the LoRA, free to use). The full trainer code, captioning pipeline, and inference scripts are in the [GitHub repo](https://github.com/zhener562/hage) under `HiDream-O1-Image/`

.

If you train something cool with this recipe — a character LoRA, a style LoRA, an NSFW-enhancing LoRA — I'd love to see it. The more community LoRAs exist for O1, the better for everyone.