cd /news/artificial-intelligence/margin-runtime-confidence-calibratio… · home topics artificial-intelligence article
[ARTICLE · art-13551] src=arxiv.org pub= topic=artificial-intelligence verified=true sentiment=↑ positive

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

A new online calibration method called MARGIN corrects systematic mis-calibration in foundation model confidence scores, achieving 3-6x lower calibration error than existing design-time methods under distribution shift. In multi-agent coordination tasks, MARGIN raises pairwise resolution from below-random levels (45-56%) to 70-89%, surpassing the always-best-model oracle on three of four benchmarks. The method requires no model access, no held-out data, and no retraining, learning per-agent calibration factors directly from the task stream.

read1 min publishedMay 25, 2026

arXiv:2605.22949v1 Announce Type: new Abstract: Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @margin 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/margin-runtime-confi…] indexed:0 read:1min 2026-05-25 ·