AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier

A developer built AdaBoost from scratch and created an interactive demo showing how the algorithm combines hundreds of weak classifiers—decision stumps—into a strong ensemble. The implementation uses weighted error minimization and an exponential reweighting scheme that forces each new stump to focus on previously misclassified points. The demo visualizes how AdaBoost's training error provably decreases over rounds.

Here is a question that sounds like a trick: can you build an accurate classifier out of models that are barely better than flipping a coin? Surprisingly, yes. That is the whole idea behind boosting, and AdaBoost is the algorithm that made it famous. I built it from scratch and dropped it into an interactive demo — here's how it actually works, real math, no hand-waving. Play with the live version: https://dev48v.infy.uk/ml/day21-adaboost.html https://dev48v.infy.uk/ml/day21-adaboost.html AdaBoost's building block is the simplest classifier you can imagine: a decision stump . It is a decision tree with exactly one split. Look at one feature, compare it to one threshold, and call everything on one side "+1" and everything on the other side "−1". That's it. One line, one cut. python def stump predict X, dim, thresh, polarity : pred = np.ones len X if polarity == 1: pred X :, dim <= thresh = -1 else: pred X :, dim thresh = -1 return pred On anything that isn't trivially separable, a single stump is hopeless — on a checkerboard layout it barely passes 55-60%. That is exactly why it's a "weak learner": a model that only beats random guessing by a hair. The magic is in how we combine hundreds of them. The engine of AdaBoost is a weight on every training point that says "how much does getting this one right matter?" Everything starts equal: n = len X w = np.full n, 1.0 / n uniform: every point weighs 1/n These weights are a probability distribution — they sum to 1. After each round they change: points we got right get lighter, points we missed get heavier. Since we always pick the next stump to minimise weighted error, the heavy points end up dominating the search. The next stump is effectively forced to stare at whatever the committee keeps blowing. When we hunt for the best stump each round, we don't count mistakes — we add up the weight of the mistakes: python def weighted error pred, y, w : return w pred = y .sum weight of the misses, not the count Early on, with uniform weights, this is just the usual error rate. But once some points are heavy, a stump that nails those heavy points scores a low weighted error even if it fumbles a few light ones. So "best stump" quietly shifts every round toward the current hard cases — and we never had to tell it which points are hard. The weights say it for us. Once we know a stump's weighted error, we decide how loud its vote will be in the final ensemble: eps = 1e-10 err = min max err, eps , 1 - eps guard the log alpha = 0.5 np.log 1 - err / err Stare at the shape of that formula, because every piece earns its place: err → 0 , the ratio 1-err /err explodes and alpha → +∞ . A near-perfect stump dominates. err = 0.5 , the ratio is 1, ln 1 = 0 , so a coin-flip stump gets alpha = 0 — no say at all. err 0.5 , the log goes negative, so a worse-than-random stump gets a The logarithm isn't decoration. It's the exact value that minimises the exponential loss AdaBoost is secretly doing gradient descent on. That is why boosting provably drives training error down. Now we reshape the weights so the next stump faces a harder problem: pred = stump predict X, dim, thresh, polarity w = w np.exp -alpha y pred right shrinks, wrong grows by exp alpha w = w / w.sum renormalise so sum w == 1 again When the stump is right, y pred = +1 , the exponent is negative, and the weight shrinks. When it's wrong the weight grows by exactly exp alpha — a confident stump reweights harder. Then we divide by the total so the weights sum back to 1, a valid distribution again. I verified the chain numerically: after every round the renormalised weights sum to 1.0 to ten decimals, and alpha tracks the formula exactly 0.0 at err=0.5, 1.099 at err=0.1, 2.298 at err=0.01 . In the demo this is why the misclassified points visibly swell round after round. The final model isn't a plain majority vote. It's a weighted one: python def predict ensemble, X : total = np.zeros len X for alpha, dim, thr, pol in ensemble: total += alpha stump predict X, dim, thr, pol return np.sign total Ask every stump for its ±1 answer, scale each by its alpha, add them up, take the sign. Confident stumps swing the sum hard; weak ones barely nudge it. Formally, F x = sign Σ αₜ·hₜ x — an additive model. In my demo, the blocky shaded background is the sign of exactly this sum evaluated across the whole plane. On XOR-style data I watched it climb from 60% train accuracy to 85% over 25 rounds, with each individual stump still stuck near 40% error the entire time. That is the payoff: no single learner improved, but the committee did. Contrast it with random forests. Bagging averages many strong, low-bias trees to cut variance . Boosting does the opposite: it starts with high-bias stumps that badly underfit and adds them one at a time, each correcting the residual of the whole. So the ensemble's bias falls steadily and the boundary grows more expressive every round. Boosting turns underfitting models into a flexible one — that's its signature. Boosting can overshoot. Enough rounds will drive training error to zero, but past a point AdaBoost starts fitting the noise and test error creeps back up. Because it weights hard points heavily, it's especially touchy about mislabelled examples and outliers — it keeps doubling down on points it can never win. The cures are the usual: cap the number of rounds, shrink each alpha with a learning rate, and pick both with cross-validation. You'd never hand-roll this in production. Scikit-learn hands it to you in one object: python from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier clf = AdaBoostClassifier estimator=DecisionTreeClassifier max depth=1 , a stump n estimators=50, T rounds learning rate=1.0, shrinks each alpha clf.fit X train, y train max depth=1 is exactly our stump. n estimators is the number of rounds. learning rate is the alpha-shrinkage that fights overfitting. Everything maps straight onto the loop we just built by hand. AdaBoost with stumps is what powered the classic Viola-Jones face detector that made real-time face detection possible. Gradient boosting XGBoost, LightGBM has largely taken over since, but AdaBoost is still the clearest way to see boosting: reweight, refit, revote. Drag the rounds slider and watch it happen live: https://dev48v.infy.uk/ml/day21-adaboost.html https://dev48v.infy.uk/ml/day21-adaboost.html