MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

wpnews.pro

cd /news/artificial-intelligence/margin-runtime-confidence-calibratio… · home › topics › artificial-intelligence › article

[ARTICLE · art-13551] src=arxiv.org ↗ pub=2026-05-25T04:00Z topic=artificial-intelligence verified=true sentiment=↑ positive

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

A new online calibration method called MARGIN corrects systematic mis-calibration in foundation model confidence scores, achieving 3-6x lower calibration error than existing design-time methods under distribution shift. In multi-agent coordination tasks, MARGIN raises pairwise resolution from below-random levels (45-56%) to 70-89%, surpassing the always-best-model oracle on three of four benchmarks. The method requires no model access, no held-out data, and no retraining, learning per-agent calibration factors directly from the task stream.

read1 min views4 publishedMay 25, 2026

arXiv:2605.22949v1 Announce Type: new Abstract: Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/margin-runtime-confidenc…

Read original on arxiv.org → arxiv.org/abs/2605.22949

mentioned entities

MARGIN

metadata

slugmargin-runtime-confidence-calibration-for-multi-agent-foundation-model

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevThe Eternal Sloptember

next →Samsung memory workers call off …

── more in #artificial-intelligence 4 stories · sorted by recency

arxiv.org · 17 Jul · #artificial-intelligence

ReasFlow: Assisting Reasoning-Centric Scientific Discovery in Applied Mathematics via a Knowledge-Based Multi-Agent System

dev.to · 17 Jul · #artificial-intelligence

A11 vs Agentic Systems: How Vertical Integrity Solves the Core Failures of Modern AI Agents

adversariallogic.com · 17 Jul · #artificial-intelligence

No Free Lunch: Why Every AI Model Is Vulnerable by Design

arxiv.org · 17 Jul · #artificial-intelligence

Information-Theoretic Limits of Reliability and Scaling in Language Models

── more on @margin 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #large-language-models

Gemini 3.5 Pro Delayed to July 17: Architectural Rebuild Explained

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required