Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

wpnews.pro

cd /news/machine-learning/performance-analysis-and-optimizatio… · home › topics › machine-learning › article

[ARTICLE · art-33553] src=arxiv.org ↗ pub=2026-06-19T04:00Z topic=machine-learning verified=true sentiment=↑ positive

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

Researchers at arXiv analyzed the performance of Med-DDPM, a 3D generative diffusion model for MRI synthesis, across three generations of NVIDIA GPUs. They identified inefficiencies in memory access and Tensor Core utilization, and applied TF32 Tensor Core activation and a 3D channels-last layout to achieve up to 100x reduction in SM cycles and dynamic instructions, with no loss in synthesis quality.

read1 min views1 publishedJun 19, 2026

arXiv:2606.19365v1 Announce Type: new Abstract: Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/performance-analysis-and…

Read original on arxiv.org → arxiv.org/abs/2606.19365

mentioned entities

NVIDIA

Med-DDPM

A100

cuDNN

Tensor Core

arXiv

metadata

slugperformance-analysis-and-optimization-of-3d-generative-diffusion-models-across

topic#machine-learning

secondary4 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevNewegg deal drops RTX 5060 Ti 16…

next →Stop Saying "It Works on My Mach…

── more in #machine-learning 4 stories · sorted by recency

developer.nvidia.com · 15 Jun · #machine-learning

Boosting MoE Training Throughput with Advanced Fusion Kernels

cs153.stanford.edu · 19 Jun · #machine-learning

CS 153: Frontier Systems

letsdatascience.com · 18 Jun · #machine-learning

CoreWeave Deploys NVIDIA Vera Rubin NVL72 Infrastructure

letsdatascience.com · 18 Jun · #machine-learning

Rumble Launches Quake AI After Northern Data Acquisition

── more on @nvidia 3 stories trending now

wpnews · 18 Jun · #large-language-models

ICYMI: ZAI launches GLM-5.2 open model with 1M context

wpnews · 18 Jun · #ai-chips

Apple and Intel join forces in Trump’s push to bring chipmaking home

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required