cd /news/machine-learning/winograd-convolutions-cost-us-2-map-… · home topics machine-learning article
[ARTICLE · art-30674] src=dev.to ↗ pub= topic=machine-learning verified=true sentiment=· neutral

Winograd convolutions cost us 2 mAP and we didn't notice for a month

A team of CV engineers deploying a pedestrian detector on a Cortex-A53 board enabled Winograd convolution for a speedup but lost 2.1 mAP due to fp16 overflow in the F(4,3) transform, with small objects dropping 7 points. The accuracy degradation went unnoticed for a month until a per-distance breakdown revealed the issue. The team resolved it by applying Winograd selectively, keeping direct convolution on early layers to preserve small-object accuracy.

read4 min views1 publishedJun 17, 2026

TL;DR: We turned on Winograd convolution to shave latency off a pedestrian detector running on a Cortex-A53, got a clean 18% speedup, and silently lost 2.1 mAP because the F(4,3) transform overflowed in fp16. The accuracy drop hid inside our aggregate metric for almost a month before a per-distance breakdown caught it.

So, the thing is, Winograd convolution is one of those optimisations that looks free. You replace the direct 3x3 convolution with a set of input transforms, elementwise multiplies, and an output transform, and the arithmetic count drops. For F(4,3), the standard tiling, you go from 36 multiplies per output tile down to 16. On paper that's a 2.25x reduction in MACs for your 3x3 layers, and 3x3 is most of a modern backbone.

We run a small detector on a Cortex-A53 board for an indoor people-counting product, MobileNetV3 backbone, roughly 4.2M params after pruning. The team is three CV engineers and one firmware person. We had a 41ms inference budget and were sitting at 39ms, which is the kind of margin that keeps you up at night.

Our runtime exposes Winograd as a per-layer flag. We flipped it on for every 3x3 stride-1 layer, rebuilt, and measured.

./bench --model det_v3.onnx --conv-algo direct

./bench --model det_v3.onnx --conv-algo winograd-f43 --precision fp16

18% off the mean, p99 comfortably under budget. We shipped it. Espresso, done, on to the next ticket.

The detector's overall mAP on our validation set moved from 0.612 to 0.608. Four thousandths. That's inside the noise we normally see between training runs, so nobody blinked. We pin our eval against a fixed 3,800-image set and a 0.004 wobble is genuinely not signal most days.

The problem only showed up when a customer reported that the counter undercounted in a large open atrium. People far from the camera, small in the frame, were getting dropped. When we broke mAP down by object size instead of looking at the single number, the picture was ugly.

Object size (px) mAP direct mAP Winograd fp16 delta
large (>96) 0.781 0.779 -0.002
medium (32-96) 0.644 0.631 -0.013
small (<32) 0.402 0.331 -0.071

Small objects lost 7 points. They're a minority of the boxes, so the aggregate barely moved, but for a people counter in a big room they're the whole game.

The F(4,3) output transform has matrix entries that are not small integers. You get values like 1, 1/2, 1/4, 2, and the intermediate accumulations span a wider dynamic range than a direct convolution does. In fp32 this is fine. In fp16, with a 10-bit mantissa, the transform amplifies low-magnitude activations and then the inverse transform has to subtract them back out. Catastrophic cancellation. The features that survive are the high-contrast ones, which correspond to large, well-lit objects. The faint gradient that says "small person at the back of the room" gets rounded into mush.

We confirmed it by running the exact same weights with Winograd in fp32. Small-object mAP came back to 0.398, basically the direct number. The algorithm wasn't wrong. The algorithm in half precision was wrong for our data.

We did not throw Winograd away. We made it selective. The early layers, where the spatial resolution is high and small-object information lives, stayed on direct fp16. The deeper layers, lower resolution and more channels, kept Winograd. That recovered most of the speed without the accuracy hole.

conv_policy:
  default: winograd-f43
  precision: fp16
  overrides:
    - layers: ["stem", "stage1.*", "stage2.0"]
      algo: direct

End result: 34.6ms mean, small-object mAP at 0.395. We gave back about 2.6ms versus full Winograd and bought back 6.4 points where it mattered.

One side note on validation. To trust the size-bucketed numbers we needed clean ground truth on a fresh holdout, and hand-labelling small distant figures is miserable and inconsistent between annotators. We auto-labelled a 600-image holdout with a VLM and had humans only correct it, routing those calls through Bifrost so we could fail over between two providers when one rate-limited us mid-batch. It was one option among a few; the point is the labels were consistent enough to make the per-bucket deltas believable.

This is not a "Winograd bad" post. F(4,3) in fp16 is a perfectly good default for a lot of models, and for a classifier where you only care about top-1 it would probably have been invisible and harmless.

── more in #machine-learning 4 stories · sorted by recency
── more on @cortex-a53 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/winograd-convolution…] indexed:0 read:4min 2026-06-17 ·