You’ve probably seen the softmax function often while working with machine learning, and for good reason. It began as a tool for simple classification tasks, such as deciding whether an image contains a cat or a dog, by turning raw scores into usable probabilities. Today, it’s a core part of large language models like GPT, where it converts prediction scores into a probability distribution for selecting the next token. Softmax also drives the attention mechanism by normalizing similarity scores so the model can focus on relevant inputs.
Given how central it is to both classic ML tasks and modern AI systems, it’s natural to ask: why this specific formula — exponentials divided by their sum? The explanation traces back through 150 years of physics, psychology, and mathematics.
The formula isn’t new. It wasn’t invented in a machine learning lab in the 1990s. The same mathematical form emerged independently three times across 150 years — first from a physicist studying thermodynamics, then from a psychologist modeling human choice, and finally from an engineer trying to make neural networks output valid probabilities. Each arrived at the formula by asking variations of the same question. None of them knew the others were coming.
This is the story of how the same answer kept appearing in wildly different fields, and what that tells us about mathematics.
Picture Vienna. Ludwig Boltzmann is 24 years old, brilliant, and obsessed with a question that seems almost philosophical: Why do gas molecules behave the way they do?
Not “what laws do they follow?” That was already understood. But why those laws? What principle of nature makes trillions of randomly bouncing particles produce the consistent, predictable behavior we observe?
The dominant thinking was that nature naturally selects the most probable state. Boltzmann thought differently. He proposed something radical:
Nature doesn’t care about probability. It cares about disorder.
Among all the ways molecules could be arranged while obeying the constraints (fixed total energy, fixed volume), nature explores the arrangements that can occur in the most ways. The maximum disorder. The maximum entropy.
This sounds backwards until you think about a simpler example. Shuffle a deck of cards once. You get one specific arrangement. Shuffle a billion times. You’ll almost never see the cards in perfect order (2♠, 2♣, 2♥, 2♦, 3♠…) because there’s only ONE way for that to happen. But you’ll frequently see random-looking mixes because there are billions of ways for cards to be randomly mixed.
The same principle applies to gas molecules. There are vastly more ways for molecules to be evenly distributed throughout a container than for them all to be crammed in one corner. So that’s what you observe: even distribution. Not because the molecules “want” it, but because that’s the only arrangement that’s probable enough to actually happen.
Boltzmann turned this intuition into mathematics. He asked: What probability distribution maximizes entropy while respecting the constraint that total energy is fixed?
To solve this, Boltzmann used a standard technique: Lagrange multipliers, a method for turning a constrained optimization problem into an unconstrained one.
He set up the problem:
The Lagrangian looked like this:
ℒ = -Σ pᵢ ln(pᵢ) + λ(1 — Σ pᵢ) + β(E — Σ pᵢεᵢ)
Then he took derivatives and set them to zero. Here’s the key step:
∂ℒ/∂pᵢ = -(ln(pᵢ) + 1) — λ — βεᵢ = 0
Solving for pᵢ:
ln(pᵢ) = -λ — 1 — βεᵢ
And then — this is the crucial moment — he exponentiated both sides:
pᵢ = exp(-λ — 1) · exp(-βεᵢ)
The exponential didn’t come from nowhere. It’s the inverse of the logarithm that appears in the entropy formula. You can’t solve this equation without exponentiating. The math forced the exponential form.
Cleaning this up and applying the normalization constraint (probabilities sum to one):
pᵢ = exp(-βεᵢ) / Σⱼ exp(-βεⱼ)
With β = 1/(kT), where k is Boltzmann’s constant and T is temperature:
pᵢ = exp(-εᵢ/kT) / Σⱼ exp(-εⱼ/kT)
This is the Boltzmann distribution. It describes how particles spread across energy states in a system at thermal equilibrium. Thirty years later, J. Willard Gibbs formalized and extended this work in his 1902 textbook Elementary Principles in Statistical Mechanics, which is why it’s also called the Gibbs distribution.
This is softmax. Boltzmann just didn’t call it that.
Boltzmann had discovered something profound: if you want a probability distribution that respects certain constraints but makes no extra assumptions beyond those constraints, there’s exactly one answer. Not many answers. One.
The exponential form is unique. It emerges necessarily from the mathematics. You can’t choose a different formula without either violating the constraints or adding hidden assumptions.
This is important. Hold onto it.
Fast forward 91 years. R. Duncan Luce, a mathematical psychologist at Indiana University, is studying something completely different: how people choose between options.
Luce was building a formal theory of choice. If you’re a rational agent, what mathematical properties should your choices satisfy?
He proposed an axiom that sounds abstract but is actually profound:
If you prefer coffee to tea with a certain probability ratio, introducing a third option (like hot chocolate) shouldn’t change that ratio.
This is called Independence of Irrelevant Alternatives. It captures something intuitive: my preference between two options shouldn’t depend on whether a completely unrelated third option exists.
Mathematically, if I prefer option A over option B, then:
P(A) / P(B) = f(A) / f(B)
for some value function f, regardless of what other options are available.
Now here’s the question Luce asked: what function f satisfies this property?
He proved something striking: the exponential function is essentially the only one that works.
If each option has a value vᵢ, then:
P(i) = exp(vᵢ) / Σⱼ exp(vⱼ)
Luce didn’t know about Boltzmann’s thermodynamics work when he derived this. He was solving a problem in psychology, not physics. Yet the mathematics demanded the exponential form.
What you observe in a gas at thermal equilibrium and what you observe in human choice behavior follow the same distribution. The formula is universal.
Why? Because both problems are fundamentally asking the same thing: How do you convert values into probabilities in a way that’s mathematically consistent and doesn’t encode hidden assumptions?
The answer, it turns out, is always the exponential. The mathematics doesn’t care what field you’re in.
Now jump ahead 30 years to London, late 1980s. John S. Bridle, a researcher at the Royal Signals and Radar Establishment, has a practical problem.
He’s building neural networks for speech recognition. But the networks have an annoying flaw: they don’t output probabilities.
Neural networks at the time:
For a classifier, you want probabilities. You want to say “this signal is 78% likely to be a fricative consonant” not “this signal produced a raw score of 2.3.”
How do you force a neural network to output valid probabilities?
Bridle realized something elegant. If you use:
Then you’re actually performing maximum mutual information estimation.
Mutual information I(X; Y) measures how much knowing X reduces uncertainty about Y. For a classifier: how much do the input features reduce uncertainty about the correct class?
In his landmark 1989 NIPS paper, Bridle showed that:
He wasn’t inventing a new formula. He was recognizing that the exponential normalization — the same form Boltzmann used in 1868, the same form Luce used in 1959 — was exactly what neural networks needed.
For contrast, consider the hardmax (also called argmax) function:
hardmax([2.0, 1.0, 0.1]) = [1, 0, 0]
It picks the single largest value and returns 1 for it, 0 for everything else. Hard. Discrete. Non-differentiable. You can’t backpropagate through it.
Now softmax:
softmax([2.0, 1.0, 0.1]) ≈ [0.659, 0.242, 0.099]
It’s a “soft” version of max. It still emphasizes the largest value, but smoothly. It’s differentiable everywhere, preserves ranking, maintains interpretability.
The gradient of softmax combined with cross-entropy loss is remarkably clean:
∂L/∂z = (predicted probability — actual probability)
That’s it. No complicated derivatives. Backpropagation becomes elegant. In his 1990 follow-up paper, Probabilistic Interpretation of Feedforward Classification Network Outputs, Bridle wrote:
“We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.”
The term “softmax” was immediately adopted by the neural networks community. It was intuitive. It described exactly what the function did. And the mathematical foundation — connecting to maximum entropy, information theory, and exponential families — gave it legitimacy.
By the early 2000s, softmax + cross-entropy was the standard for classification. It still is. Over 35 years later, when you build a classifier, this is what you use.
What was the Boltzmann distribution in physics, Luce’s choice axiom in psychology, is now softmax in machine learning.
This is the part that matters. The question isn’t just “how did three fields discover the same formula?” but “why did they have to?”
The maximum entropy principle states: among all probability distributions consistent with known constraints, choose the one with highest entropy (most uncertainty).
This principle appears everywhere in science:
And when you apply maximum entropy to categorical variables with linear constraints, you always get exponential form. Not sometimes. Always.
Why? Because the logarithm in the entropy formula forces the exponential in the solution. They’re inverse functions. You can’t escape it.
The exponential function has properties that make it perfect for probability:
These properties aren’t accidents. They’re why the exponential appears in so many contexts.
For two classes, softmax reduces to the logistic sigmoid:
P(class 1) = 1 / (1 + exp(-(z₁ — z₂)))
Logistic regression is one of the oldest, most well-studied models in statistics. It has deep connections to maximum likelihood estimation and information theory.
Softmax is its natural generalization to multiple classes. The theory was already there. We just extended it.
Right now, when you chat with Claude or ChatGPT or any language model, here’s what’s happening:
The model looks at everything you’ve typed. It processes it through billions of neural networks. At the very end, it produces scores for every possible next word in its vocabulary — maybe 50,000 scores for 50,000 words.
Raw scores. Meaningless without context. The model might output:
How does it decide what to say next? It needs probabilities.
Softmax transforms those scores into probabilities.
"the" → 0.68 (68% likely)"a" → 0.24 (24% likely) "hello" → 0.07 (7% likely)"banana" → 0.01 (1% likely)
Now the model can pick. Usually it picks the highest probability word. Sometimes it samples from the distribution to add randomness. But the underlying mechanism is softmax — the same formula Boltzmann derived in 1868.
Transformers do this thousands of times per second, for every token, in every conversation. The attention mechanism uses softmax to focus on relevant parts of text. The final output layer uses softmax to select the next word.
Every word you’re reading right now, in responses from any language model, came from softmax deciding what probability to assign.
A formula from thermodynamics. Discovered to explain gases. Now deciding which word the AI says to you next.
That’s how universal this formula is.
A Brief History of Softmax: What It Is, Where It Came From, and How It Became Essential was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.