# Winograd convolutions cost us 2 mAP and we didn't notice for a month

> Source: <https://dev.to/marcorinaldi_ai/winograd-convolutions-cost-us-2-map-and-we-didnt-notice-for-a-month-1b3e>
> Published: 2026-06-17 07:22:23+00:00

**TL;DR: We turned on Winograd convolution to shave latency off a pedestrian detector running on a Cortex-A53, got a clean 18% speedup, and silently lost 2.1 mAP because the F(4,3) transform overflowed in fp16. The accuracy drop hid inside our aggregate metric for almost a month before a per-distance breakdown caught it.**

So, the thing is, Winograd convolution is one of those optimisations that looks free. You replace the direct 3x3 convolution with a set of input transforms, elementwise multiplies, and an output transform, and the arithmetic count drops. For F(4,3), the standard tiling, you go from 36 multiplies per output tile down to 16. On paper that's a 2.25x reduction in MACs for your 3x3 layers, and 3x3 is most of a modern backbone.

We run a small detector on a Cortex-A53 board for an indoor people-counting product, MobileNetV3 backbone, roughly 4.2M params after pruning. The team is three CV engineers and one firmware person. We had a 41ms inference budget and were sitting at 39ms, which is the kind of margin that keeps you up at night.

Our runtime exposes Winograd as a per-layer flag. We flipped it on for every 3x3 stride-1 layer, rebuilt, and measured.

```
# before
./bench --model det_v3.onnx --conv-algo direct
# mean 39.1ms  p99 44.0ms

# after
./bench --model det_v3.onnx --conv-algo winograd-f43 --precision fp16
# mean 32.0ms  p99 35.8ms
```

18% off the mean, p99 comfortably under budget. We shipped it. Espresso, done, on to the next ticket.

The detector's overall mAP on our validation set moved from 0.612 to 0.608. Four thousandths. That's inside the noise we normally see between training runs, so nobody blinked. We pin our eval against a fixed 3,800-image set and a 0.004 wobble is genuinely not signal most days.

The problem only showed up when a customer reported that the counter undercounted in a large open atrium. People far from the camera, small in the frame, were getting dropped. When we broke mAP down by object size instead of looking at the single number, the picture was ugly.

| Object size (px) | mAP direct | mAP Winograd fp16 | delta |
|---|---|---|---|
| large (>96) | 0.781 | 0.779 | -0.002 |
| medium (32-96) | 0.644 | 0.631 | -0.013 |
| small (<32) | 0.402 | 0.331 | -0.071 |

Small objects lost 7 points. They're a minority of the boxes, so the aggregate barely moved, but for a people counter in a big room they're the whole game.

The F(4,3) output transform has matrix entries that are not small integers. You get values like 1, 1/2, 1/4, 2, and the intermediate accumulations span a wider dynamic range than a direct convolution does. In fp32 this is fine. In fp16, with a 10-bit mantissa, the transform amplifies low-magnitude activations and then the inverse transform has to subtract them back out. Catastrophic cancellation. The features that survive are the high-contrast ones, which correspond to large, well-lit objects. The faint gradient that says "small person at the back of the room" gets rounded into mush.

We confirmed it by running the exact same weights with Winograd in fp32. Small-object mAP came back to 0.398, basically the direct number. The algorithm wasn't wrong. The algorithm in half precision was wrong for our data.

We did not throw Winograd away. We made it selective. The early layers, where the spatial resolution is high and small-object information lives, stayed on direct fp16. The deeper layers, lower resolution and more channels, kept Winograd. That recovered most of the speed without the accuracy hole.

```
conv_policy:
  default: winograd-f43
  precision: fp16
  overrides:
    # high-res early stages carry small-object signal
    - layers: ["stem", "stage1.*", "stage2.0"]
      algo: direct
```

End result: 34.6ms mean, small-object mAP at 0.395. We gave back about 2.6ms versus full Winograd and bought back 6.4 points where it mattered.

One side note on validation. To trust the size-bucketed numbers we needed clean ground truth on a fresh holdout, and hand-labelling small distant figures is miserable and inconsistent between annotators. We auto-labelled a 600-image holdout with a VLM and had humans only correct it, routing those calls through Bifrost so we could fail over between two providers when one rate-limited us mid-batch. It was one option among a few; the point is the labels were consistent enough to make the per-bucket deltas believable.

This is not a "Winograd bad" post. F(4,3) in fp16 is a perfectly good default for a lot of models, and for a classifier where you only care about top-1 it would probably have been invisible and harmless.
