{"slug": "fastwan-qad-generating-a-5-second-video-in-1-78-seconds-on-a-single-nvidia-rtx", "title": "FastWan-QAD: Generating a 5-Second Video in 1.78 Seconds on a Single NVIDIA RTX 5090 via Quantization-Aware Distillation", "summary": "FastVideo released FastWan-QAD, a family of video generation models that can produce a 5-second 480P video in 1.78 seconds on a single NVIDIA RTX 5090 using quantization-aware distillation. The models achieve this speed by combining NVFP4 linear layers with FP4 attention backends and kernel fusion, while maintaining quality through a two-stage training process. The release includes three checkpoints targeting different hardware, all under the Apache-2.0 license.", "body_md": "**TL;DR:** **5 seconds of video. 1.78 seconds of generation. One RTX 5090.**\nFastVideo introduces **FastWan-QAD**, a family of video generation models trained with a new recipe we term **Quantization-Aware Distillation (QAD)**. Powered by FastVideo, we push a single RTX 5090 to its absolute limit: generating a 5-second 480P video in **1.78s end-to-end**, outperforming both TurboDiffusion and LightX2V. Our flagship model targets native NVFP4 for the RTX 5090. We are concurrently releasing a second model utilizing FP8 linear layers to extend support to the RTX 4090 architecture.\n\n## What We Are Releasing[#](#what-we-are-releasing)\n\nWe are excited to release three distilled checkpoints of Wan2.1-T2V-1.3B, alongside our full QAD training recipe and inference code:\n\n**FastWan-QAD-1.3B**: Designed for NVIDIA GPUs with native NVFP4 tensor cores. It combines NVFP4 linear layers with our modified SageAttention3 FP4 backend, achieving an incredible**1.78 seconds** end-to-end for a 5-second 480p video.**FastWan-QAD-1.3B-SA2**: Also utilizing NVFP4 linear layers, this variant integrates SageAttention2++ instead, achieving a** 2 second**end-to-end generation for a 5-second 480p video while achieving higher quality than the SageAttention3 variant.** FastWan-QAD-FP8-1.3B**: Built for previous-generation GPUs lacking FP4 tensor cores, specifically the RTX 4090. It swaps in FP8 linear layers alongside SageAttention2++, trained using the exact same QAD recipe as the Blackwell models.\n\nAll resources, including weights and scripts, are released under the **Apache-2.0** license.\n\n| Model Name | Checkpoint | Target Hardware | Precision (Linear + Attn) | Tier |\n|---|---|---|---|---|\n`FastWan-QAD-1.3B` |\n|\n\n**Flagship**: minimal latency via native NVFP4, 1.78s for a 5s 480p video.`FastWan-QAD-1.3B-SA2`\n\n[Huggingface Model](https://huggingface.co/FastVideo/FastWan-QAD-1.3B-SA2)**Alternative**: sharpest video quality at minimal latency cost, 2.01s for a 5s 480p video.`FastWan-QAD-FP8-1.3B`\n\n[Huggingface Model](https://huggingface.co/FastVideo/FastWan-QAD-FP8-1.3B)**Compatibility**: full 8-bit pipeline fallback, 3.4s for a 5s 480p video.## The Inference Stack[#](#the-inference-stack)\n\nWe achieve our excellent performance by attacking every layer of the stack: precision, attention, kernel fusion, compilation, and decoding. To maximize video quality, we avoid sparse routing entirely and keep attention 100% dense, scaling the pipeline down to aggressive low-bit precisions across the three hardware-targeted configurations above.\n\n**Quantize Everything.** Every major linear layer in the DiT is quantized to its hardware-specific low-bit representation (NVFP4 or FP8), with activations quantized on the fly. We match the linear precision with either an FP4 (SageAttention3) or FP8 (SageAttention2++) dense attention backend.\n\n**Quantization-Aware Distillation.** None of these speedups matter if visual quality collapses, and naive low-bit attention (especially NVFP4) visibly degrades video. We recover the base model’s quality with a two-stage QAT recipe: a quantization-aware finetune that matches the target precision matrix, followed by quantization-aware DMD distillation down to just **3 sampling steps**. Throughout distillation, the attention path uses fake quantization in the backward pass following our [Attn-QAT](https://haoailab.com/blogs/attn-qat/) method, forcing the model to adapt to low-bit attention errors during training. We also found that the best training data differs by checkpoint: FastWan-QAD-1.3B is distilled on **real video data (Mixkit)**, while the SA2 and FP8 variants use our **synthetic Wan2.1-14B data**. We determined this split empirically, each configuration reaches its highest quality with the corresponding data source.\n\n**Kernel Fusion.** A large fraction of wall-clock time in a small DiT is the “glue” around the matmuls: ops like LayerNorm, AdaLN modulation, residual adds, and gating. We fuse these into single kernels: one pass for the pre-attention modulated norm, and a combined gated-residual-add + norm + scale + shift for the post-attention path. This collapses what were many small memory-bound launches per block into a couple of fused ops. On top of this, the DiT, text encoder, and decoder are fully compiled to eliminate launch and Python runtime overhead.\n\n**Fast Decoding and No CFG.** For decoding we swap the full Wan VAE for [TAEHV](https://github.com/madebyollin/taehv), a tiny autoencoder, removing the VAE as a latency bottleneck. We run those 3 steps with CFG disabled, halving the per-step transformer cost — the final ingredient that brings the full pipeline to 1.78 seconds.\n\n## Comparison[#](#comparison)\n\nWe evaluate video generation on **a single RTX 5090 GPU**. End-to-end times below cover the full generation pipeline.\n\n| Method | E2E Time |\n|---|---|\n| Original Wan2.1-1.3B | 170s |\n| TurboDiffusion | 6.10s |\n| LightX2V Wan-NVFP4 | 6.91s |\nFastWan-QAD (Ours) | 1.78s |\n\nA 4-way qualitative comparison across TurboDiffusion, LightX2V, our FP4 attention + FP4 linear model, and our FP8 attention + FP4 linear model, all generating **5-second 480p videos** on a single **RTX 5090**.\n\n| TurboDiffusion | LightX2V | FastWan-QAD | FastWan-QAD-SA2 |\n|---|---|---|---|\n\n## How to Run[#](#how-to-run)\n\nComing soon — waiting on the FastVideo code merge. Inference instructions and scripts will be added here once available.\n\n## Next Steps[#](#next-steps)\n\nOptimizing the 1.3B architecture is just the beginning. We are actively extending the QAD recipe to scale up to larger frontier models, including Wan2.1-14B and the NVIDIA Cosmos 2.5 / 3 families. Furthermore, we are exploring image-to-video (I2V) distillation to bring **Dreamverse**, our interactive vibe directing workspace, off of enterprise hardware and directly onto consumer GPUs. Stay tuned!\n\nWe welcome and value any feedback, contributions, and collaboration. If you have a feature or model request for Dreamverse, feel free to join our [Slack](https://join.slack.com/t/fastvideo/shared_invite/zt-412taon6b-~Ijpdj2UCeJPDjdgve~r3A) channel or submit an issue at our [repo](https://github.com/hao-ai-lab/FastVideo). To contribute, please check out [Contributing to FastVideo](https://hao-ai-lab.github.io/FastVideo/contributing/overview.html) for how to get involved!\n\n## Acknowledgements[#](#acknowledgements)\n\nWe thank [NVIDIA](https://www.nvidia.com/en-us/) and [MBZUAI](https://mbzuai.ac.ae/) for supporting the development and release of FastWan-QAD.\n\n## FastVideo Team[#](#fastvideo-team)\n\n**Core contributors:** [Loay Rashid](https://x.com/l0ayrashid), [Matthew Noto](https://github.com/RandNMR73)**Contributors:** [Alex Zhang](https://alexzms.github.io), [Kaiqin Kong](https://github.com/H1yori233), [Kevin Lin](https://github.com/kevin314)**Tech leads:** [Loay Rashid](https://github.com/loaydatrain), [Will Lin](https://solitarythinker.github.io/), [Hao Zhang](https://haozhang.ai/)**Advisors:** [Hao Zhang](https://haozhang.ai/), [Eric Xing](https://www.cs.cmu.edu/~epxing/)", "url": "https://wpnews.pro/news/fastwan-qad-generating-a-5-second-video-in-1-78-seconds-on-a-single-nvidia-rtx", "canonical_source": "https://haoailab.com/blogs/fastwan-qad/", "published_at": "2026-06-15 19:00:00+00:00", "updated_at": "2026-06-16 21:20:43.719514+00:00", "lang": "en", "topics": ["generative-ai", "ai-infrastructure", "ai-research", "ai-tools", "computer-vision"], "entities": ["FastVideo", "FastWan-QAD", "NVIDIA RTX 5090", "NVIDIA RTX 4090", "SageAttention3", "SageAttention2++", "Mixkit", "Wan2.1-T2V-1.3B"], "alternates": {"html": "https://wpnews.pro/news/fastwan-qad-generating-a-5-second-video-in-1-78-seconds-on-a-single-nvidia-rtx", "markdown": "https://wpnews.pro/news/fastwan-qad-generating-a-5-second-video-in-1-78-seconds-on-a-single-nvidia-rtx.md", "text": "https://wpnews.pro/news/fastwan-qad-generating-a-5-second-video-in-1-78-seconds-on-a-single-nvidia-rtx.txt", "jsonld": "https://wpnews.pro/news/fastwan-qad-generating-a-5-second-video-in-1-78-seconds-on-a-single-nvidia-rtx.jsonld"}}