{"slug": "a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-scaling", "title": "A Deep Dive into Calibration of Language Models: Platt Scaling, Isotonic Regression, Temperature Scaling", "summary": "Large language models (LLMs) frequently exhibit miscalibration, where their reported confidence scores do not match actual accuracy rates, with studies showing mean calibration scores as low as 23.9% in biomedical models. Three post-hoc recalibration methods—temperature scaling, Platt scaling, and isotonic regression—are commonly used to correct this gap, but applying them to LLMs requires careful adaptation due to the models' exponentially large output spaces and limited API access.", "body_md": "# A Deep Dive into Calibration of Language Models: Platt Scaling, Isotonic Regression, Temperature Scaling\n\nDiscover three post-hoc methods for closing the gap between confidence and accuracy.\n\n## # Introduction\n\nA model that says it is 90% confident should be right 90% of the time. When that relationship breaks down, you get a **miscalibration** problem. The model's scores stop telling you anything useful about reliability.\n\nFor **large language models** (LLMs), miscalibration is widespread. A [2024 NAACL survey](https://aclanthology.org/2024.naacl-long.366/) found that confidence scores diverge from actual correctness rates across factual QA, code generation, and reasoning tasks.\n\nAnother [study](https://www.biorxiv.org/content/10.1101/2025.02.11.637373v1.full) on biomedical models found mean calibration scores ranging from only 23.9% to 46.6% across all tested models. The gap is consistent.\n\nThe standard solution in **classical machine learning** is post-hoc recalibration: fit a simple function on a held-out validation set to map raw confidence scores to better-calibrated probabilities.\n\n**Three** methods dominate: [ temperature scaling](https://github.com/gpleiss/temperature_scaling),\n\n[, and](https://www.blog.trainindata.com/complete-guide-to-platt-scaling/)\n\n**Platt scaling**[. All three were designed for](https://en.wikipedia.org/wiki/Isotonic_regression#:~:text=Isotonic%20regression%20is%20used%20iteratively,of%20supervised%20machine%20learning%20models.)\n\n**isotonic regression**[discriminative classifiers](https://medium.com/@akankshamalhotra24/generative-classifiers-v-s-discriminative-classifiers-1045f499d8cc), and applying them to LLMs requires care.\n\n## # Measuring Calibration\n\nThe dominant metric is [ Expected Calibration Error](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d/) (ECE). It groups predictions into confidence bins, computes the gap between mean confidence and the observed accuracy in each bin, and averages across bins weighted by size. ECE = 0 is perfect calibration.\n\nA reliability diagram plots confidence against accuracy. A perfectly calibrated model sits on the diagonal. An overconfident model sits below it: the curve shows high confidence, but accuracy doesn't keep up.\n\nA [2025 evaluation](https://aejaspan.github.io/posts/2025-09-01-LLM-Clasifier-Confidence-Scores) of GPT-4o-mini as a text classifier found that 66.7% of its errors occurred at over 80% confidence — the canonical overconfidence pattern.\n\nECE alone is increasingly viewed as insufficient. A [research paper](https://arxiv.org/html/2512.16030) recommends pairing ECE with the [Brier score](https://en.wikipedia.org/wiki/Brier_score), overconfidence rates, and reliability diagrams together. A single number obscures meaningful variation in where and how a model misbehaves.\n\n## # Why LLMs Complicate the Standard Setup\n\nThe three methods we cover assume a fixed output space. A classifier produces one **probability** per class, and calibration maps them to better estimates.\n\n**LLMs** don't work this way.\n\nFour complications matter here.\n\nThe output space is exponentially large: sequence-level confidence can't be enumerated. Semantically equivalent outputs may have very different token-level probabilities. Confidence disagrees across granularities; a [research paper](https://aclanthology.org/2024.naacl-long.366/) on atomic calibration showed that generative models exhibit their lowest average confidence in the middle of generation, not at the start or end.\n\nAnd many LLMs only expose top-k token probabilities through their **API**, so classical calibration approaches that rely on full logit access need modification.\n\n## # Applying Temperature Scaling\n\nTemperature scaling divides the logit vector by a scalar T before applying softmax. When T > 1, the distribution flattens and confidence drops. When T < 1, the distribution sharpens and confidence rises.\n\nT is fit on a held-out validation set by minimizing negative log-likelihood. The method adds one parameter, preserves prediction rankings, and is cheap to compute.\n\nThe [original formulation](https://github.com/gpleiss/temperature_scaling) targeted DenseNet image classifiers. For LLMs, temperature controls the probability distribution over the vocabulary at each decoding step, so the same logic applies.\n\nThe problem is [ Reinforcement Learning from Human Feedback](https://huggingface.co/blog/rlhf) (RLHF). Post-RLHF models develop input-dependent overconfidence: the degree of miscalibration varies across inputs, and a single T can't account for that variation.\n\nAverage ECE scores above 0.377 have been documented for models like GPT-3 in verbalized confidence tasks, and a [2025 survey](https://arxiv.org/html/2505.18658v2) confirms that RLHF-tuned models consistently overestimate confidence across the board.\n\n[ Adaptive Temperature Scaling](https://arxiv.org/abs/2409.19817) (ATS) addresses this directly. ATS predicts a per-token temperature from token-level hidden features, fit on a supervised fine-tuning dataset, instead of using a single fixed T. Researchers confirmed that ATS improved calibration by 10–50% without hurting task performance. For any RLHF-tuned model, ATS is a stronger baseline than standard temperature scaling.\n\nStandard temperature scaling still works well for base models before RLHF. When miscalibration is roughly uniform across inputs, a single T is often enough to correct systematic over- or underconfidence.\n\nThe problem is specific to post-RLHF models, where input-dependent overconfidence means a single T can't correct all inputs.\n\n## # Applying Platt Scaling\n\nPlatt scaling fits a logistic function over the uncalibrated scores: p = σ(A·s + B), where A and B are learned from a held-out validation set with binary correctness labels.\n\nThe sigmoid shape gives a parametric mapping with two free parameters.\n\nPlatt scaling was originally developed for SVMs but generalizes to any system that produces a scalar confidence score.\n\nThe two-parameter fit is also data-efficient compared to isotonic regression: it can produce usable estimates from a smaller calibration set, which matters in deployment contexts where labeled correctness data is limited.\n\nIn LLM contexts, Platt scaling operates over sequence-level or token-level confidence scores.\n\nA [paper](https://www.software-lab.org/publications/icse2025_calibration.pdf) on LLM-generated code confidence found that Platt scaling produced better-calibrated outputs than uncalibrated scores. Another study on LLMs for text-to-SQL introduced [ Multivariate Platt Scaling](https://arxiv.org/html/2409.10855v1) (MPS), extending single-variable Platt scaling to combine sub-clause frequency scores across multiple generated samples — consistently outperforming single-score baselines.\n\nTwo **limitations** are documented. First, global sequence-level Platt scaling is too coarse for tasks where correctness depends on local edit decisions: a single sigmoid mapping can't capture sample-dependent miscalibration patterns.\n\nBesides, Platt scaling can degrade proper scoring performance for strong models.\n\n## # Applying Isotonic Regression\n\nIsotonic regression takes the non-parametric route.\n\nIt learns a piecewise-constant, monotonically non-decreasing mapping from uncalibrated scores to calibrated probabilities using the [ Pool Adjacent Violators Algorithm](https://medium.com/@jhimli.c1/unveiling-the-magic-of-pava-a-simple-path-to-monotonic-regression-37f19ffa60df) (PAVA). There's no assumed shape for the calibration function, which makes it more flexible than Platt scaling when the confidence-accuracy relationship isn't sigmoid-shaped.\n\nThe piecewise-constant output adapts to any monotone shape: linear, stepped, or concave. That adaptability is the main reason isotonic regression tends to outperform Platt scaling in empirical comparisons.\n\nThe cost is overfitting risk on small calibration sets. The mapping only generalizes well when there's enough data to constrain it.\n\nEmpirically, isotonic regression outperforms Platt scaling.\n\nA rigorous [comparison](https://arxiv.org/html/2509.23665v1) across multiple datasets and architectures found that isotonic regression beat Platt scaling on ECE and Brier score with statistical significance, using paired t-tests with Bonferroni correction at α = 0.003.\n\nIn that study, a Random Forest baseline improved from a reliability score of 0.8268 uncalibrated, to 0.9551 with Platt scaling, to 0.9660 with isotonic regression. Both methods could degrade proper scoring performance for strong models, but the isotonic edge held consistently.\n\nFor LLM multiclass settings, it has been shown that standard isotonic regression can be improved further with normalization-aware extensions, consistently outperforming both OvR isotonic regression and standard parametric methods on NLL and ECE.\n\nThe data requirement is the binding constraint. Isotonic regression's advantage is real, but it doesn't transfer to low-data deployment scenarios.\n\n## # What the Literature Leaves Open\n\nThree **gaps** are worth flagging before deploying any of these methods.\n\nThe **RLHF** interaction has been studied only for temperature scaling. How **Platt scaling** and isotonic regression perform on post-RLHF models hasn't been systematically tested. **ATS** exists because standard temperature scaling needed an explicit fix for this case. Whether the other two methods need similar extensions is an open question.\n\nMost direct **comparisons** of all three methods come from the general machine learning calibration literature. LLM-specific benchmarks that test all three head-to-head are rare. The ICSE 2025 code calibration [paper](https://www.software-lab.org/publications/icse2025_calibration.pdf) is one of the few, and its scope is limited to code generation.\n\nCalibration set size is a real deployment constraint. Isotonic regression results from papers assume datasets large enough to constrain the mapping. In production with limited labeled examples, the gap between isotonic regression and Platt scaling may close or reverse.\n\n## # Conclusion\n\n**Temperature scaling** is the right starting point for most teams. For base models without RLHF, a single T often does enough.\n\nFor **RLHF**-tuned models, switch to ATS: the per-token temperature handles the input-dependent overconfidence that a global scalar misses.\n\n**Platt scaling** is the practical choice when the calibration set is small or when calibration needs to slot into a larger pipeline. It's data-efficient and straightforward to implement. The limitation is scope: it can't capture miscalibration that varies across samples, and it tends to degrade performance for strong models.\n\n**Isotonic regression** has the strongest empirical track record of the three. Use it when the calibration set is large enough to constrain the mapping without overfitting, and pair it with normalization-aware extensions in multiclass settings.\n\nThe decision that comes before all of these is what \"**confidence**\" means for the task. Token probability, sequence probability, verbalized confidence, and consistency across samples can give different values for the same output. A calibration method applied to the wrong signal doesn't improve reliability. Getting that definition right is the prerequisite for any of the methods above to work.\n\nis a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.\n\n[Nate Rosidi](https://twitter.com/StrataScratch)", "url": "https://wpnews.pro/news/a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-scaling", "canonical_source": "https://www.kdnuggets.com/a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-regression-temperature-scaling", "published_at": "2026-06-05 14:00:11+00:00", "updated_at": "2026-06-05 14:53:05.169852+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "artificial-intelligence", "neural-networks", "natural-language-processing"], "entities": ["Platt Scaling", "Isotonic Regression", "Temperature Scaling", "NAACL"], "alternates": {"html": "https://wpnews.pro/news/a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-scaling", "markdown": "https://wpnews.pro/news/a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-scaling.md", "text": "https://wpnews.pro/news/a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-scaling.txt", "jsonld": "https://wpnews.pro/news/a-deep-dive-into-calibration-of-language-models-platt-scaling-isotonic-scaling.jsonld"}}