# Training-Free Single-Image Diffusion Models

> Source: <https://haojunqiu.github.io/efficient-SID/>
> Published: 2026-06-05 22:27:54+00:00



```
 CVPR 2026 Highlight
```

We consider the problem of generating images whose internal structure—defined by the distribution of patches across multiple scales—matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

There are multiple reasons why single-image generative modeling is useful. In particular, learning a prior over image patches **imposes strong constraints on the content of a generated image**. Consider combining the style of one image with the structure of another: large models like GPT Image 2 or Nano Banana fill in content drawn from internet-scale training data, so the resulting patch statistics deviate from those of the input images. In contrast, single-image models constrain the output to contain exactly the patches present in the inputs, keeping style and structure closely aligned.

This kind of constraint is valuable in applications where data is scarce. Moreover, single-image modeling provides **clear provenance over the content of the generated image**.

The optimal denoiser in diffusion models has a known closed-form solution: a weighted average over all clean signals, where weights reflect how likely each produced the noisy input. For large datasets like ImageNet (millions of high-dimensional images), this sum is completely intractable — and as noise decreases, it collapses to memorization.

**But for patches of a single image, both problems disappear.** The dataset is finite and low-dimensional — the sum is tractable. And since we denoise *patches*, not whole images, stitching overlapping patch predictions together produces a rich distribution over global images, not a copy of the input.

No neural network. No training. No optimization.

Check out a related tutorial at CVPR 2026: [Analytic Understanding of Diffusion Models](https://analytic-diffusion.github.io).

We extract overlapping patches from the reference image as our dataset. Starting from noise, at each timestep we extract noisy patches, apply the closed-form denoiser, and stitch predictions back into a full image. We generate at a coarse scale first to establish global structure, then progressively add finer scales — preserving layout while adding detail.

The closed-form denoiser gives us the *training-free* part. But for the *efficient* part — scaling to megapixel and beyond — the naïve O(N²) cost per denoising step is a bottleneck. Interestingly, all three of our acceleration techniques are inspired by ideas that drove the scaling of large generative models:

We show (Supp. Sec. 3.2) that our closed-form denoiser can be cast exactly as scaled dot-product attention — no specialized kernel needed. We simply call FlashAttention off-the-shelf for a ~2× speedup, fully exact.

↳ same kernel that powers large transformersA pretrained VAE acts as a pure spatial downsampler: 8× per side → 64× fewer patches. Since the denoiser is quadratic, this means ~4096× fewer FLOPs per step.

↳ same idea behind latent diffusion (Stable Diffusion)Cluster patches, probe nearby clusters only. Far-away patches have negligible attention weight — skip them. Reduces O(N²) to O(N3/2).

Together: megapixel in under a second · gigapixel in minutes

Our patch-based diffusion framework naturally supports several downstream applications without any architectural changes.

**Retargeting is content-aware resizing:** given a single input image, we generate new images at arbitrary target aspect ratios.
Naive resizing stretches and distorts objects (e.g., penguins become fat!). Our method instead synthesizes new content while preserving the natural proportions and patch statistics of the original.

We generate tileable images by enforcing boundary consistency during diffusion. Each output tiles seamlessly in both directions. Different random seeds yield diverse tileable results from the same input.

Given a structure image and a style image, we transfer the patch distribution of the style onto the spatial layout of the structure — without any text prompt or neural style-transfer network.

By combining our patch-based denoiser with CLIP guidance, we enable text-driven style transfer that preserves the internal patch statistics of the input.

Video coming soon

```
@inproceedings{qiu2026efficientsid,
  title     = {Efficient and Training-Free Single-Image Diffusion Models},
  author    = {Qiu, Haojun and Kutulakos, Kiriakos N. and Lindell, David B.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```


