{"slug": "pid-fast-and-high-resolution-latent-decoding-with-pixel-diffusion", "title": "PID: Fast and High-Resolution Latent Decoding with Pixel Diffusion", "summary": "Researchers have developed PiD, a pixel diffusion decoder that directly transforms latent representations into high-resolution images, bypassing the traditional decode-then-super-resolve pipeline. The system decodes 512×512 latents into 2048×2048 pixel images in under one second on a consumer RTX 5090 GPU, achieving up to 5.9× faster processing than cascaded diffusion-based super-resolution methods while improving visual fidelity. PiD unifies decoding and upsampling into a single generative module, enabling 4× and 8× upscaling with low latency and compatibility with both conventional VAE and semantic latents.", "body_md": "# PiD:\n\nFast and High-Resolution Latent Decoding\n\nwith Pixel Diffusion\n\n*\nTL;DR:\nPiD directly decodes latent representations into high-resolution images, replacing the decode–then–super-resolve cascade while achieving lower latency and higher visual quality.\n*\n\n## Abstract\n\nMost practical high-resolution text-to-image systems rely on latent diffusion models, where generation is performed in a compact latent space and a decoder maps latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce **PiD**, a **Pi** xel diffusion **D** ecoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, **PiD** synthesizes 4× and even 8× upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling **PiD** to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. **PiD** applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. **PiD** decodes latents of 512×512 images into 2048×2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6× faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.\n\n## Results\n\n### From Latent to Pixels\n\n### 4K Decode\n\n### Baseline Comparison\n\n### Quantitative Results (Decoding + Upsampling, 512² → 2048²)\n\nPiD is up to **5.9× faster** than SeedVR2 (211.2 ms vs 1237.5 ms)\n\n% of evaluations where judges prefer **PiD** over each baseline", "url": "https://wpnews.pro/news/pid-fast-and-high-resolution-latent-decoding-with-pixel-diffusion", "canonical_source": "https://research.nvidia.com/labs/sil/projects/pid/", "published_at": "2026-05-25 15:23:18+00:00", "updated_at": "2026-05-25 15:37:50.815265+00:00", "lang": "en", "topics": ["generative-ai", "computer-vision", "machine-learning", "neural-networks", "ai-research"], "entities": ["PiD", "Pixel Diffusion Decoder", "DMD2", "SigLI"], "alternates": {"html": "https://wpnews.pro/news/pid-fast-and-high-resolution-latent-decoding-with-pixel-diffusion", "markdown": "https://wpnews.pro/news/pid-fast-and-high-resolution-latent-decoding-with-pixel-diffusion.md", "text": "https://wpnews.pro/news/pid-fast-and-high-resolution-latent-decoding-with-pixel-diffusion.txt", "jsonld": "https://wpnews.pro/news/pid-fast-and-high-resolution-latent-decoding-with-pixel-diffusion.jsonld"}}