AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier

wpnews.pro

Here is a question that sounds like a trick: can you build an accurate classifier out of models that are barely better than flipping a coin?

Surprisingly, yes. That is the whole idea behind boosting, and AdaBoost is the algorithm that made it famous. I built it from scratch and dropped it into an interactive demo — here's how it actually works, real math, no hand-waving.

Play with the live version: https://dev48v.infy.uk/ml/day21-adaboost.html

AdaBoost's building block is the simplest classifier you can imagine: a decision stump. It is a decision tree with exactly one split. Look at one feature, compare it to one threshold, and call everything on one side "+1" and everything on the other side "−1". That's it. One line, one cut.

def stump_predict(X, dim, thresh, polarity):
    pred = np.ones(len(X))
    if polarity == 1:
        pred[X[:, dim] <= thresh] = -1
    else:
        pred[X[:, dim] >  thresh] = -1
    return pred

On anything that isn't trivially separable, a single stump is hopeless — on a checkerboard layout it barely passes 55-60%. That is exactly why it's a "weak learner": a model that only beats random guessing by a hair. The magic is in how we combine hundreds of them.

The engine of AdaBoost is a weight on every training point that says "how much does getting this one right matter?" Everything starts equal:

n = len(X)
w = np.full(n, 1.0 / n)   # uniform: every point weighs 1/n

These weights are a probability distribution — they sum to 1. After each round they change: points we got right get lighter, points we missed get heavier. Since we always pick the next stump to minimise weighted error, the heavy points end up dominating the search. The next stump is effectively forced to stare at whatever the committee keeps blowing.

When we hunt for the best stump each round, we don't count mistakes — we add up the weight of the mistakes:

def weighted_error(pred, y, w):
    return w[pred != y].sum()   # weight of the misses, not the count

Early on, with uniform weights, this is just the usual error rate. But once some points are heavy, a stump that nails those heavy points scores a low weighted error even if it fumbles a few light ones. So "best stump" quietly shifts every round toward the current hard cases — and we never had to tell it which points are hard. The weights say it for us.

Once we know a stump's weighted error, we decide how loud its vote will be in the final ensemble:

eps = 1e-10
err = min(max(err, eps), 1 - eps)      # guard the log
alpha = 0.5 * np.log((1 - err) / err)

Stare at the shape of that formula, because every piece earns its place:

err → 0

, the ratio (1-err)/err

explodes and alpha → +∞

. A near-perfect stump dominates.err = 0.5

, the ratio is 1, ln(1) = 0

, so a coin-flip stump gets alpha = 0

— no say at all.err > 0.5

, the log goes negative, so a worse-than-random stump gets a The logarithm isn't decoration. It's the exact value that minimises the exponential loss AdaBoost is secretly doing gradient descent on. That is why boosting provably drives training error down.

Now we reshape the weights so the next stump faces a harder problem:

pred = stump_predict(X, dim, thresh, polarity)
w = w * np.exp(-alpha * y * pred)   # right shrinks, wrong grows by exp(alpha)
w = w / w.sum()                     # renormalise so sum(w) == 1 again

When the stump is right, y * pred = +1

, the exponent is negative, and the weight shrinks. When it's wrong the weight grows by exactly exp(alpha)

— a confident stump reweights harder. Then we divide by the total so the weights sum back to 1, a valid distribution again.

I verified the chain numerically: after every round the renormalised weights sum to 1.0 to ten decimals, and alpha tracks the formula exactly (0.0 at err=0.5, 1.099 at err=0.1, 2.298 at err=0.01). In the demo this is why the misclassified points visibly swell round after round.

The final model isn't a plain majority vote. It's a weighted one:

def predict(ensemble, X):
    total = np.zeros(len(X))
    for alpha, dim, thr, pol in ensemble:
        total += alpha * stump_predict(X, dim, thr, pol)
    return np.sign(total)

Ask every stump for its ±1 answer, scale each by its alpha, add them up, take the sign. Confident stumps swing the sum hard; weak ones barely nudge it. Formally, F(x) = sign(Σ αₜ·hₜ(x))

— an additive model. In my demo, the blocky shaded background is the sign of exactly this sum evaluated across the whole plane. On XOR-style data I watched it climb from 60% train accuracy to 85% over 25 rounds, with each individual stump still stuck near 40% error the entire time. That is the payoff: no single learner improved, but the committee did.

Contrast it with random forests. Bagging averages many strong, low-bias trees to cut variance. Boosting does the opposite: it starts with high-bias stumps that badly underfit and adds them one at a time, each correcting the residual of the whole. So the ensemble's bias falls steadily and the boundary grows more expressive every round. Boosting turns underfitting models into a flexible one — that's its signature.

Boosting can overshoot. Enough rounds will drive training error to zero, but past a point AdaBoost starts fitting the noise and test error creeps back up. Because it weights hard points heavily, it's especially touchy about mislabelled examples and outliers — it keeps doubling down on points it can never win. The cures are the usual: cap the number of rounds, shrink each alpha with a learning rate, and pick both with cross-validation.

You'd never hand-roll this in production. Scikit-learn hands it to you in one object:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # a stump
    n_estimators=50,     # T rounds
    learning_rate=1.0,   # shrinks each alpha
)
clf.fit(X_train, y_train)

max_depth=1

is exactly our stump. n_estimators

is the number of rounds. learning_rate

is the alpha-shrinkage that fights overfitting. Everything maps straight onto the loop we just built by hand.

AdaBoost with stumps is what powered the classic Viola-Jones face detector that made real-time face detection possible. Gradient boosting (XGBoost, LightGBM) has largely taken over since, but AdaBoost is still the clearest way to see boosting: reweight, refit, revote.

Drag the rounds slider and watch it happen live: https://dev48v.infy.uk/ml/day21-adaboost.html

source & further reading

dev.to — original article Graph of Thoughts: when a tree of reasoning isn't enough, let the branches merge DGX Spark hitting 83 C under sustained Ollama load — solved by clock-locking via nvidia-smi -lgc Anthropic hid tracking signals in Unicode apostrophes. That's not telemetry, that's steganography.

AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier

Run your AI side-project on zahid.host