In Small-Data Medical Imaging, Variance Is the Enemy

wpnews.pro

Pediatric tuberculosis is one of the harder problems in clinical imaging. In children the radiological signs are subtle and non-specific, microbiological confirmation is unreliable because the disease is often paucibacillary, and many of the settings with the highest burden have very few radiologists. A model that can reliably triage chest X-rays, flagging the children who need confirmatory testing and ruling out those who do not, has real value.

This post is a technical case study of the engineering behind one such model, developed in collaboration with a partner hospital and a public TB program in Brazil. My role was the machine learning engineering: the preprocessing, the modeling, and the deployment. Clinical direction and label definitions came from the radiology team.

A note on anonymization: institutional names and a few specifics are withheld here because the underlying results are still under peer review. Publishing them in full beforehand could count as prior publication and jeopardize the journal submission.

I want to focus on the three problems that took the most work:

The development set was 2,011 pediatric chest X-rays collected across several clinical sites in Brazil. Two characteristics shaped every decision that followed.

The first is class imbalance. TB prevalence in the development cohort was about 9.8 percent (197 positives). With a positive class that small, plain accuracy is meaningless and even AUROC can flatter a model, so the metrics that mattered were AUPRC and balanced accuracy alongside AUROC. AUPRC is the honest one here, because its no-skill baseline is the prevalence itself, around 0.10.

The second is heterogeneity. Pediatric thoraxes change shape dramatically between an infant and an adolescent, so a fixed crop or a fixed normalization does not transfer across ages. On top of that, each center used different equipment, exposure settings, and file conventions. Some of that heterogeneity showed up in ways that are easy to underestimate until they corrupt your labels, which I will come back to.

A word on what “positive” means, because pediatric TB makes this harder than it sounds. In children the disease is often paucibacillary, so microbiological confirmation frequently fails even when the child genuinely has TB. Our positive class therefore combined two kinds of label: cases confirmed by laboratory PCR, and presumptive cases where the clinical team diagnosed TB from the full clinical picture together with the radiograph. Folding presumptive cases into the positive class is the deliberate choice for a screening tool, where the cost of missing a case is high and waiting for microbiological proof that often never comes would systematically drop real positives.

The goal here was a single pipeline that takes a raw chest X-ray from any age and any center and produces a consistent input for the network. A few steps did most of the work.

Detecting the actual X-ray field. Many radiographs arrive with collimation borders, black margins, or burned-in console text that are not part of the image. The pipeline locates the exposed field in two passes: a coarse pass that reads the transition out of the near-black background to bracket the film, and a refinement pass that scores candidate rectangles by area, fill, and how centered and chest-shaped they are. The result is a bounding box around the real radiograph, and that box is what gets cropped and fed forward. On a clean full-frame image the box hugs the edges and does little; on an image with borders or a console strip it does real work. That is where most of the cross-center variance lived.

Why this mattered for our data specifically. The clinical images came from several centers, on different equipment, at genuinely different quality levels: some sites produced clean, well-exposed films, others did not. That heterogeneity is the whole reason field detection and contrast normalization earned their place. Without a step that standardizes the field and the contrast, the network would be learning the scanner and the center as much as the disease. Standardizing the signal before it reached the CNNs was vital, not cosmetic.

Aspect-preserving resize to 512. The field crop is letterboxed to 512 by 512, so the thorax keeps its true proportions instead of being stretched to a square. This is the step that lets a toddler’s image and an adolescent’s image be compared as network inputs without distorting either.

Contrast normalization with CLAHE. Exposure varied across machines, so I applied CLAHE (Contrast Limited Adaptive Histogram Equalization) to the grayscale crop. CLAHE equalizes contrast locally, which brings out lung detail consistently across sources without blowing up noise the way global equalization can. The output is single-channel grayscale replicated across three channels, which is what the ImageNet-pretrained backbones expect.

Figure 1. The pipeline applied to public Montgomery and Shenzhen chest X-rays. Left, the original with the detected field outlined in green. Right, the CLAHE-normalized 512 by 512 output. One pipeline produces consistent inputs across two different sources and both classes.

Note on the images:every figure here was generated by running the real production pipeline on public data, the Montgomery and Shenzhen TB datasets. No images from the clinical training set appear anywhere in this post, since those are pediatric patient radiographs from clinical sites and are not mine to publish.

With a small, imbalanced dataset, the central risk is variance. A single deep network trained on 2,000 images will overfit, and its performance will swing depending on which images happen to land in the validation split. Two design choices addressed this.

Five-fold cross-validation, used for everything. Rather than carving off a single fixed validation set and wasting a fifth of an already small dataset, I used five-fold cross-validation and reported out-of-fold (OOF) predictions. Every image gets a prediction from a model that never saw it in training, which gives an honest estimate while using all the data. Because the study reports OOF predictions, it is more accurate to describe a single development cohort than to pretend there is a separate held-out validation split.

An ensemble across many architectures. Instead of betting on one network, I trained 13 architectures, each across the five folds, for 65 models in total. Each was initialized from ImageNet weights, with the backbone frozen to train the classifier head first and the last stage then unfrozen for fine-tuning. The final score for an image is the mean predicted probability across the ensemble. Averaging diverse architectures reduces variance far more reliably than tuning a single model, and in a small-data regime that variance reduction is most of the battle.

The 13 were ConvNeXt-Tiny, ConvNeXt-Small, EfficientNet-B0, EfficientNet-V2-S, MobileNetV3-Small, MobileNetV3-Large, ShuffleNetV2-x1.0, ResNet18, ResNet50, DenseNet121, RegNet-Y-8GF, ViT-B/16, and a DenseNet121 initialized from CheXNet chest-X-ray weights. Most are convolutional, with one vision transformer and one backbone pretrained on chest radiographs instead of natural images, so the ensemble spans different inductive biases and different pretraining, not just different depths.

The payoff shows up in the numbers. Individually, these models landed between about 0.63 and 0.72 AUROC on the internal OOF cohort, with ResNet50 the strongest single architecture at 0.718. The ensemble mean reached 0.744, above every individual member. That gap is the variance reduction doing its job, and it is the whole reason for training thirteen models instead of crowning one. Worth noting against intuition: the chest-X-ray-pretrained DenseNet was the weakest single model here at 0.632, a reminder that in-domain pretraining is not automatically better than ImageNet on a given task.

I also tested whether selecting the top few architectures by validation AUPRC would help. It looked better on paper but it was selection bias dressed up as an improvement, so I dropped it. The plain mean of all architectures is the cleaner, more defensible primary metric, and that is what the paper reports.

One more negative result is worth stating plainly, because negative results rarely get published. I tested adding lateral views through cascade and fusion approaches. At this dataset scale, frontal-only consistently matched or beat the multi-view approaches on AUPRC. The lateral views added noise rather than signal, so the production model is frontal-only.

Figure 2. Architecture of the ensemble: 13 architectures, each trained across 5 folds, averaged into a single mean probability.

The headline numbers come from two cohorts: the internal development cohort (OOF predictions, n = 2,011) and an external validation cohort (n = 199), drawn from the pediatric images within the Montgomery and Shenzhen datasets (the main public TB chest X-ray sets, otherwise adult) plus independent clinical data.

On the internal cohort the ensemble mean reached an AUROC of 0.744 (95% CI 0.705 to 0.781) and an AUPRC of 0.279 (0.226 to 0.342). On the external cohort it reached an AUROC of 0.860 (0.785 to 0.924) and an AUPRC of 0.412 (0.237 to 0.650). Against a no-skill baseline equal to the prevalence, around 0.10, that external AUPRC is roughly four times chance, which says more about real performance on the rare positive class than the AUROC does.

The external number being higher than the internal one is not a contradiction. The two cohorts have different case mixes, and the only test that counts for a screening tool meant to leave its home institution is whether it generalizes to data it never trained on. It does, at least on this cohort.

One honest caveat on that external number. The public images are, on average, cleaner and acquired on better-controlled equipment than the heterogeneous multi-center clinical data the model trained on. So part of the high external AUROC reflects an easier, higher-quality cohort, not pure generalization skill. What that caveat does not touch is the version-over-version gain: 0.751 to 0.860 is measured on the same external cohort, so the improvement is real and not an artifact of cohort quality.

The bigger story is that jump from the previous model version, and what drove it. The previous version was trained on about 1,700 images; the new one adds roughly 300 more, bringing the development cohort to 2,011. That is the only substantive change between the two, so the gain is attributable to more data rather than to architecture or hyperparameter tuning. Internally the two versions are essentially tied, AUROC 0.741 to 0.744, a difference of 0.002. Externally the new version went from 0.751 to 0.860, a gain of 0.109 with a DeLong p of 0.0015 and a 95% CI on the difference of 0.044 to 0.183. The improvement is specifically in cross-site generalization, which is exactly where it counts, and it came from adding a few hundred well-curated images to a small dataset, a reminder of how much each image is worth in this regime.

Figure 3. ROC curves for the ensemble mean across the development (OOF) and external validation cohorts.

For a screening tool the threshold matters as much as the curve. At the Youden’s J operating point on the external cohort the model showed a clear screening profile: sensitivity 0.95, specificity 0.61, NPV 0.991, PPV 0.22, and balanced accuracy 0.78. It missed one of the twenty positives. That is the shape you want for triage, because a negative result is highly trustworthy, so the model can rule children out and concentrate confirmatory testing where it is needed. The honest caveat: twenty positives is a small denominator, so the sensitivity estimate is encouraging but wide, and this threshold was chosen on the same cohort it is reported on, so treat the operating point as illustrative rather than locked in. We are actively expanding the external validation cohort with pediatric images from other countries, to test the model on a wider range of scanners, populations, and disease patterns than the current set covers. Once that larger, more diverse external cohort is in place, it becomes the right basis for selecting and locking a production threshold, rather than tuning the operating point on the same handful of cases it is reported on. Until then, the numbers above are a credible screening profile, not a fixed deployment setting.

Figure 4. Operating point on the external ROC curve, with sensitivity, specificity, and NPV annotated.

A model that only runs in a notebook does not help anyone. Training ran on AWS SageMaker, on an ml.g4dn.xlarge GPU instance, but the deployment target was a different world: a server inside the partner program rather than a cloud environment I controlled. The production model was packaged as a container with DICOM handling built in, so it ingests the format hospitals actually produce rather than expecting pre-converted images. It runs frontal-only inference as the production configuration: as Part 2 showed, frontal-only matched or beat the multi-view approaches at this dataset scale, so there was no reason to ship anything heavier.

The deployment had a deliberate constraint: it had to run inside the institution, so no patient image leaves it. The ensemble was exported to ONNX, DICOM decoding was handled in-container, and the model was exposed through a local REST endpoint so the equipment calls it over the internal network. That on-prem, data-stays-local design is what keeps it compliant and deployable in a public-health setting. It was validated end to end on the program’s own hardware, a single NVIDIA RTX A4000 16 GB, a modest prosumer GPU rather than a datacenter cluster, which matters because the sites with the highest TB burden are the least likely to have serious GPU infrastructure to spare.

The point I would make to anyone doing similar work: budget real time for the deployment surface. DICOM decoding, codec support, and getting a heavy container to run reliably on someone else’s hardware is its own project, separate from the modeling.

Used as intended, this is a triage and rule-out tool. The high sensitivity and very high negative predictive value let it prioritize confirmatory testing, directing limited expert attention to the children most likely to need it. It is not a replacement for a radiologist or for microbiological confirmation, and it was never designed to be.

The limitations are real and worth naming plainly. The training data came entirely from clinical sites in Brazil, so good performance elsewhere has to be earned, not assumed.

There is no dedicated open pediatric TB imaging set. The two main public TB datasets, Montgomery and Shenzhen, are otherwise adult, and the groups that hold large pediatric cohorts generally do not release them. That scarcity is itself one of the hardest constraints of this problem, and it is why the external cohort had to be assembled from the few pediatric cases sitting inside those public sets plus independent clinical data, at 199 images and 20 positives. Good numbers on a problem this data-starved are hard-won, and the same scarcity is exactly what keeps the external validation small. The model did well on it, which is encouraging. Even so, a couple of hundred images pulled from a handful of sources is not broad, multi-site, independent pediatric validation, and I will not pretend it is.

The risk that follows is distribution drift. A new deployment site brings its own scanners, exposure settings, patient mix, and disease patterns, none of which have to match the Brazilian development distribution. This is exactly why the planned cross-country external validation is not a formality: images from other countries will carry their own distributions, and the screening profile that looks clean here could move in either direction on data that differs from what the model has seen. The honest next step is prospective validation on independent pediatric cohorts in deployment, with monitoring for drift, rather than treating the external AUROC as a settled property of the model.

Where this is meant to go. The goal was never a benchmark number; it was a tool that reaches the children who need it. The direction we are working toward is a deployable screening aid integrated into the public diagnostic pathway for pediatric TB in Brazil, through the SUS, the country’s universal public health system that covers the entire population at no point of care. The case for it is concrete: many towns in the interior of Brazil have no radiologist on site at all, so a chest X-ray taken there may wait days to be read, or be read by a non-specialist. In that gap, a preliminary screening tool that runs locally and flags the films that need urgent expert attention has real triage value, and it is most useful precisely where specialists are scarcest. The framing matters: this is one more instrument in the diagnostic process for pediatric TB, sitting alongside the clinician and the confirmatory test to help triage and prioritize, not a system that decides on its own. One could imagine it entering the network first as exactly that preliminary screen and evolving from there as it earns broader validation. That is a direction, not a roadmap, and it is the kind of future that building for accessible, in-facility hardware from the start is meant to keep open.

If I had to compress this into a few lessons for the next person building a medical imaging model on a small, multi-center dataset: None of these are exotic. They are the unglamorous discipline of small-data medical imaging: respect the variance, distrust the easy win, and put the work into the inputs and the validation rather than the cleverness of the model.

It comes down to this. When the positive class is fewer than two hundred cases, every single one is precious, and there is no volume of data to paper over a sloppy pipeline. That is what makes the preprocessing almost artisanal: each image has to be found, verified, field-detected, and normalized with care, because one silently dropped or distorted positive is a meaningful fraction of everything the model has to learn from. And it is why the internal cross-validation is not a box to tick but the backbone of the whole effort, the only way to get an honest read on a dataset too small to spare a held-out split. Get the careful, per-image work and the rigorous internal validation right, and the reward is not a leaderboard score. It is a model that still behaves when it leaves the building it was trained in, which, for a tool meant to help find sick children where specialists are scarce, is the only result that matters.

In Small-Data Medical Imaging, Variance Is the Enemy was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article How to Run DeepSeek Locally on Your Own Computer, and the Catch Most Guides Skip RAG Evaluation 101: What to Measure (and What Not to) Sakana AI Wrapped an Entire Multi-Agent System Into One API (And It Beats Frontier Models on…

In Small-Data Medical Imaging, Variance Is the Enemy

Run your AI side-project on zahid.host