cd /news/machine-learning/channels-last-memory-format-cut-our-… · home topics machine-learning article
[ARTICLE · art-37322] src=dev.to ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

Channels-last memory format cut our conv backbone latency 22%

Photoroom reduced inference latency of its convolutional segmentation backbone by 22% on A100 GPUs by switching to PyTorch's channels-last memory format. The change required only four lines of code and did not affect accuracy. The speedup comes from cuDNN selecting more efficient kernels for NHWC layout on tensor-core hardware.

read4 min views5 publishedJun 24, 2026

TL;DR: Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cut inference latency by about 22% on A100s, with no accuracy change and a four-line code edit.

Our background-removal model at Photoroom spent roughly 31 ms per 1024x1024 image on an A100, and profiling pointed most of that time at cuDNN convolution kernels rather than our diffusion sampler. The model is a fairly standard U-Net style encoder-decoder, all convolutions, running in float16 under torch.autocast

. Before touching the architecture, I wanted to rule out the cheap wins, and the cheapest one turned out to be tensor memory layout. The channels-last memory format gave us most of the speedup we were chasing, and the change fit in a handful of lines. To be precise, the network math is identical; only the byte order of the activations changes.

The channels-last memory format stores a 4D activation tensor in NHWC byte order, keeping the channel values for one spatial position contiguous in memory. PyTorch keeps the logical NCHW shape, so your indexing and your model code stay the same. What changes is the stride pattern, which lets cuDNN select kernels that read contiguous channels and run more efficiently on tensor-core hardware.

The default PyTorch layout is NCHW (channels-first), where all of one channel's pixels sit together. NVIDIA's tensor cores prefer the NHWC arrangement for convolutions, as documented in their convolution performance guide. When your tensors arrive in NCHW, cuDNN often inserts transpose passes around each convolution to reshuffle data, and those transposes are pure overhead. Converting once at the input and keeping the format consistent removes that per-layer reshuffling.

The conversion API has been stable since well before PyTorch 2.3, and the official memory format tutorial covers the details. Two things need the format: the module parameters and the input tensor. If only one of them is channels-last, cuDNN falls back to NCHW kernels and you gain nothing.

import torch

model = model.to(memory_format=torch.channels_last)

x = x.to(memory_format=torch.channels_last)

with torch.autocast("cuda", dtype=torch.float16):
    y = model(x)  # output is channels_last; convert back if a

One subtlety worth checking: x.to(memory_format=torch.channels_last)

is a no-op on a 3D tensor, so make sure your inputs carry an explicit batch dimension. After the forward pass, the output keeps channels-last strides. If you feed it into an operation that assumes contiguous NCHW, call .contiguous()

there rather than reverting the whole pipeline.

Tensor cores execute matrix-multiply-accumulate on small tiles, and convolutions get lowered to those tile operations. With NHWC layout the channel dimension, which is the contracting dimension of the implicit matmul, is contiguous, so the kernel loads aligned vectors without gathering strided data. The effect grows with channel count. Our deepest encoder blocks at 512 channels saw the largest per-layer improvement, while the early high-resolution layers at 64 channels barely moved.

The gain also depends on precision. Channels-last pairs with float16 or bfloat16, because tensor cores only engage in reduced precision; in pure float32 the kernels often route through CUDA cores where the layout advantage shrinks. We were already running float16 under autocast, so the two optimizations stacked. The nuance here is that channels-last is not a free win in every configuration. It is a win when your convolutions are wide, your precision is reduced, and your hardware has tensor cores.

A layout change is easy to misattribute, so I measured carefully. I ran 200 warmup iterations, then timed 1000 forward passes with torch.cuda.synchronize()

around each measurement window, since CUDA calls are asynchronous and an unsynchronized timer reports queue time rather than kernel time. I also confirmed the output tensors matched the NCHW baseline within float16 tolerance, so I knew I was timing the same computation.

The headline number was a drop from roughly 31 ms to 24 ms per image, about 22% on our A100. On a V100 the same change gave closer to 14%, which tracks with its older tensor-core generation. I would treat any single-number claim with suspicion until you reproduce it on your own shapes; the benefit is real but hardware-dependent and model-dependent.

The format is not universally beneficial. Networks dominated by pointwise operations, normalization, or attention rather than spatial convolutions show little or no improvement, because those ops do not hit the cuDNN convolution path that NHWC accelerates. Transformer backbones, for instance, rarely care.

There is also a correctness trap. Mixing layouts inside a model can silently insert transposes that erase the gain, and some custom operators or older third-party layers assume contiguous NCHW and will either copy or error. If you run torch.compile

, verify the format survives the traced graph rather than assuming it does. For very small channel counts the conversion overhead can outweigh the kernel savings, so profile before committing it everywhere.

The channels-last memory format is one of the few optimizations that costs almost nothing to try and is straightforward to revert if it does not help. For a convolution-heavy vision model running in float16 on tensor-core GPUs, it is worth measuring before you reach for quantization or architectural surgery. What I would try next is combining it with torch.compile

and a CUDA graph capture, then re-profiling to see how much transpose overhead is actually left in the trace.

── more in #machine-learning 4 stories · sorted by recency
── more on @photoroom 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/channels-last-memory…] indexed:0 read:4min 2026-06-24 ·