{"slug": "margin-runtime-confidence-calibration-for-multi-agent-foundation-model", "title": "MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination", "summary": "A new online calibration method called MARGIN corrects systematic mis-calibration in foundation model confidence scores, achieving 3-6x lower calibration error than existing design-time methods under distribution shift. In multi-agent coordination tasks, MARGIN raises pairwise resolution from below-random levels (45-56%) to 70-89%, surpassing the always-best-model oracle on three of four benchmarks. The method requires no model access, no held-out data, and no retraining, learning per-agent calibration factors directly from the task stream.", "body_md": "arXiv:2605.22949v1 Announce Type: new\nAbstract: Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift.\nWe present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.", "url": "https://wpnews.pro/news/margin-runtime-confidence-calibration-for-multi-agent-foundation-model", "canonical_source": "https://arxiv.org/abs/2605.22949", "published_at": "2026-05-25 04:00:00+00:00", "updated_at": "2026-05-25 15:14:09.699829+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-agents", "ai-research"], "entities": ["MARGIN"], "alternates": {"html": "https://wpnews.pro/news/margin-runtime-confidence-calibration-for-multi-agent-foundation-model", "markdown": "https://wpnews.pro/news/margin-runtime-confidence-calibration-for-multi-agent-foundation-model.md", "text": "https://wpnews.pro/news/margin-runtime-confidence-calibration-for-multi-agent-foundation-model.txt", "jsonld": "https://wpnews.pro/news/margin-runtime-confidence-calibration-for-multi-agent-foundation-model.jsonld"}}