{"slug": "performance-analysis-and-optimization-of-3d-generative-diffusion-models-across", "title": "Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures", "summary": "Researchers at arXiv analyzed the performance of Med-DDPM, a 3D generative diffusion model for MRI synthesis, across three generations of NVIDIA GPUs. They identified inefficiencies in memory access and Tensor Core utilization, and applied TF32 Tensor Core activation and a 3D channels-last layout to achieve up to 100x reduction in SM cycles and dynamic instructions, with no loss in synthesis quality.", "body_md": "arXiv:2606.19365v1 Announce Type: new\nAbstract: Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.", "url": "https://wpnews.pro/news/performance-analysis-and-optimization-of-3d-generative-diffusion-models-across", "canonical_source": "https://arxiv.org/abs/2606.19365", "published_at": "2026-06-19 04:00:00+00:00", "updated_at": "2026-06-19 04:08:09.489603+00:00", "lang": "en", "topics": ["machine-learning", "generative-ai", "ai-infrastructure", "ai-chips", "computer-vision"], "entities": ["NVIDIA", "Med-DDPM", "A100", "cuDNN", "Tensor Core", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/performance-analysis-and-optimization-of-3d-generative-diffusion-models-across", "markdown": "https://wpnews.pro/news/performance-analysis-and-optimization-of-3d-generative-diffusion-models-across.md", "text": "https://wpnews.pro/news/performance-analysis-and-optimization-of-3d-generative-diffusion-models-across.txt", "jsonld": "https://wpnews.pro/news/performance-analysis-and-optimization-of-3d-generative-diffusion-models-across.jsonld"}}