Train Your Own AI Image Detector: Why Off-the-Shelf Detectors Fail on Your Data (DINOv2 + ConvNeXt…

A developer building a fashion-discovery feed found that off-the-shelf AI image detectors perform poorly on custom data, achieving only 0.68 AUC versus a simple linear model's 0.82. The article argues that AI-generated image detection is not solved due to cross-generator and cross-objective gaps, and that training on proprietary data is necessary.

Companion toHow I Distilled a Gemini Vision Model into a 4.6M-Parameter Model. Last time, the lesson was: youdon’tneed to fine-tune a backbone — freeze a big model, train a tiny head, ship it. This time I needed the backbone. The reason is the whole point. A fashion-discovery feed is only as good as its images. And a growing share of what people upload now isn’t a photo of a real outfit — it’s an AI render. Some of it is stunning. Most of it is slop : plastic skin, six-fingered hands, a dress that ignores gravity. Left alone, it rots the feed. The job: detect AI-generated images so we can filter them. My first instinct was the lazy one — yours probably is too: someone has already built this AI image detector. And they have, dozens of times. Hugging Face is full of “AI vs. human” detectors you can download and self-host this afternoon, and the brochures are glorious. One recent detector https://huggingface.co/Bombek1/ai-image-detector-siglip-dinov2 reports a 0.9997 AUC on its validation set. Near-perfect. So I downloaded a popular one and measured how well it agreed with the AI-probability labels we already trust in production. That measurement is the first snippet — and it’s the one that set the whole project in motion: python from transformers import pipelinefrom sklearn.metrics import roc auc score a popular off-the-shelf detector - downloaded and run locallyclf = pipeline "image-classification", model="Organika/sdxl-detector", device="mps" def p ai path : the detector's probability that an image is AI out = {d "label" : d "score" for d in clf path } return out "artificial" preds = p ai p for p in sample paths y true = s = 0.5 for s in teacher ai score our production labelsprint roc auc score y true, preds 0.68 - on our images It scored 0.68. A throwaway linear model I’d wire up in the next section scored 0.82. Three numbers — 0.9997, 0.68, 0.82. The first is what the shelf advertises . The second is what a real, published detector did on our data. The third is a ten-minute warm-up beating it. The whole article lives in that gap, and it isn’t a story about one bad model. This is a structural limit the research community keeps shouting about, and almost nobody building products hears. AI-generated image detection is not solved. At ICLR 2025, a paper bluntly titled A Sanity Check for AI-Generated Image Detection https://github.com/shilinyan99/AIDE built a deliberately hard benchmark and found that There are two cliffs. Cross-generator. Train a detector on GAN images and it falls apart on diffusion images — each generator family leaves a different fingerprint. You’re always detecting yesterday’s fakes. Cross-objective — the one that got me. My detector wasn’t asked “is this AI?” in the abstract. It was asked to agree with the specific AI-scores we run in production , on our image distribution — Pinterest-style fashion, not whatever set the downloaded model grew up on. Different target, different data, same word “fake” meaning two different things. No download fixes that. The signal lives in your data, so you have to train on your data — which sounds expensive, right up until it isn’t. Generators build images by upsampling — doubling resolution again and again to fill in detail. That leaves faint, regular fingerprints in the frequency domain near the high-frequency edges of the image: periodic patterns your eye never registers but a model can Frank et al., ICML 2020 https://proceedings.mlr.press/v119/frank20a/frank20a.pdf ; Durall et al., CVPR 2020; CNNSpot, CVPR 2020 . Detection isn’t about content — “is this a plausible dress?” It’s about texture statistics : the microscopic signature of how the pixels were synthesized. Hold that thought. It’s why the cheap trick almost worked — and why it didn’t quite. In my last piece https://medium.com/design-bootcamp/how-i-distilled-a-gemini-vision-model-into-a-4-6m-parameter-a44a6c40e4c2 , the move was: don’t train a vision model — freeze a big one, grab its embeddings, train a tiny head. So I tried it here: every image through DINOv2 Meta’s self-supervised ViT → a 768-dim embedding → a dead-simple logistic regression. No fine-tuning yet — just a frozen DINOv2 feature extractor and a linear probe. python import timm, torchfrom sklearn.linear model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerdino = timm.create model "vit base patch14 dinov2.lvd142m", pretrained=True, num classes=0 .eval frozen - pure inference@torch.no grad def embed batch : batch: N, 3, 224, 224 return dino batch .cpu .numpy N, 768 X = DINOv2 embeddings for ~36k images; y = teacher ai score = 0.5 Xs = StandardScaler .fit transform X probe = LogisticRegression max iter=2000, class weight="balanced" .fit Xs, y 0.82 ROC-AUC. From a linear model. No fine-tuning, no GPU, a few seconds of training. And here’s where the lazy hot-take “frozen embeddings are useless for detection” is just wrong: that 0.82 isn’t a failure — it’s a strong, free baseline, and the literature predicted it. UnivFD CVPR 2023 https://www.researchgate.net/publication/373323377 showed a linear classifier on frozen CLIP features out-generalizes CNNs trained from scratch . And recent forensics work https://arxiv.org/abs/2411.19117 found DINOv2 is especially good here — its self-supervised features preserve fine-grained texture better than CLIP’s semantic ones, so a linear head on DINOv2 can beat fully supervised models. The frozen DINOv2 baseline got me most of the way for free. The only question left: is the last mile worth it? Before fine-tuning anything, I looked at the label distribution. This is the snippet I run on every new dataset, and it saved me an afternoon of chasing the wrong fix: python import pandas as pdlabels = pd.read parquet "labels.parquet" one row/image: ai score in 0, 1 print labels.ai score < 0.1 .mean 0.75 - 3 in 4 are obviously "real"print labels.ai score = 0.5 .mean 0.11 - only ~11% are "likely AI" Three out of four images sat near zero; only ~11% were “likely AI.” I wrote “will need class weighting” in my notes before training a thing. Remember that — it’s the plot twist. The frozen DINOv2 probe plateaued at 0.82 because a frozen backbone only hands you the signal it already encodes. To surface this teacher’s artifacts, the model has to see raw pixels and adapt its own low-level filters. So I fine-tuned a small CNN end-to-end — ConvNeXt-Tiny not a random pick: a strong ICLR 2025 detector, AIDE https://github.com/shilinyan99/AIDE , is also ConvNeXt-based . Three choices mattered, and every one is backwards from normal image classification. The first one is a custom transform, and it’s the most important line in the whole project: python import randomfrom PIL import Imageclass NativeCrop: """Take a 224px square at NATIVE resolution. NEVER downscale - that smears the high-frequency artifacts we're trying to detect. Only upscale when the image is smaller than the crop.""" def init self, size=224, train=True : self.size, self.train = size, train def call self, img: Image.Image - Image.Image: w, h, s = img.size, self.size if min w, h < s: too small - upscale just enough scale = s / min w, h img = img.resize round w scale , round h scale ; w, h = img.size left = random.randint 0, w - s if self.train else w - s // 2 top = random.randint 0, h - s if self.train else h - s // 2 return img.crop left, top, left + s, top + s The other two choices: Now the model, the loss, and the training step. The loss is where the plot twist pays off: python import timm, torch, torch.nn as nnmodel = timm.create model "convnext tiny", pretrained=True, num classes=1 .to "mps" soft-label distillation: BCE against the teacher's 0–1 score. pos weight counters the ~11%-positive imbalance we found above.pos weight = torch.tensor 1 - y.mean / y.mean ≈ 6loss fn = nn.BCEWithLogitsLoss pos weight=pos weight opt = torch.optim.AdamW model.parameters , lr=1e-4, weight decay=0.05 for x, target in loader: target = teacher's soft 0–1 score with torch.autocast "mps", dtype=torch.bfloat16 : loss = loss fn model x.to "mps" .squeeze 1 , target.to "mps" opt.zero grad ; loss.backward ; opt.step Here’s the twist I set up earlier. My first fine-tune used a plain loss — no pos weight — as a baseline. It did exactly what an imbalanced dataset makes a model do: it plateaued at 0.785. Below the free probe. With a plain loss, the model found the cheapest possible strategy — call everything "real," eat the 11% error, go home. Low loss. Useless detector. The fix wasn’t a bigger model or more epochs. It was the pos weight ≈ 6 you already saw — one line — so the rare AI images actually pulled on the gradient. That single change broke the plateau and the model climbed to 0.86 validation, 0.84–0.875 on the held-out test set. Same model. Same data. One line. When a model stalls, suspect the loss and the data before you touch the architecture. I keep relearning this, and I keep writing it down so maybe one day I’ll learn it. A detector that nails the training set and flops in production is worse than useless. So the last snippet is the one that earns trust: the held-out evaluation, reported the way an imbalanced problem demands — ROC-AUC and PR-AUC , not accuracy with ~11% positives, “always predict real” scores 89% accuracy and catches zero fakes . python from sklearn.metrics import roc auc score, average precision score@torch.no grad def evaluate model, loader : p, y = , for x, target in loader: p += torch.sigmoid model x.to "mps" .squeeze 1 .cpu .tolist y += target.tolist yb = t = 0.5 for t in y return roc auc score yb, p , average precision score yb, p train-sample vs held-out test: gap ≈ 0.035 - generalizing, not memorizingprint evaluate model, test loader 0.843, ... The train-vs-test ROC gap came out to ≈ 0.035 — basically nothing. It generalized. A ConvNeXt-Tiny AI-image detector, fine-tuned end-to-end, agreeing with our production AI-scores at ~0.84–0.875 ROC-AUC . The payoff still feels slightly illegal: The detector I couldn’t download, I trained over an afternoon for the price of some electricity. Three caveats, said out loud so you don’t have to find them: When you need a model, the reflex is to go shopping. For AI-generated image detection the shelf is full — and most of it won’t fit your data. That’s not bad luck; it’s the documented generalization gap, and it’s the title of this article. But the reflex hides the real lesson: training your own AI image detector stopped being the expensive option. A frozen foundation model hands you a strong baseline for free; a few minutes of fine-tuning on a laptop closes the gap. You don’t need the GPU cluster you were dreading. You need the right loss, native-resolution crops, and the nerve to close the download tab and open a notebook. Last time I told you that you don’t need to fine-tune a backbone. This time I did — and now you can tell which room you’re standing in. The detector you can download was trained on someone else’s images. Yours weren’t. If you’ve built a model that had to survive contact with real, messy, in-the wild data — what broke? I read every reply. Train Your Own AI Image Detector: Why Off-the-Shelf Detectors Fail on Your Data DINOv2 + ConvNeXt… https://pub.towardsai.net/train-your-own-ai-image-detector-why-off-the-shelf-detectors-fail-on-your-data-dinov2-convnext-f5d1b992a71e was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.