How I Built a Pluribus-Style Poker AI From Scratch

A developer built a Pluribus-style poker AI from scratch, implementing Counterfactual Regret Minimization (CFR) and its variants to solve imperfect-information games. The project achieved Nash equilibrium convergence in Leduc Hold'em and explored abstraction techniques for full-scale no-limit Texas Hold'em.

Chess and Go are perfect information games, both players see everything. The challenge is purely computational. Poker is an imperfect information game . You can't see your opponent's cards. Every decision has to account for the full range of hands they might hold. Standard minimax doesn't work. You can't evaluate a position without knowing your opponent's hand. The solution is Nash equilibrium , a strategy where neither player can improve their EV by unilaterally changing what they do. In poker this is called GTO Game-Theoretically Optimal . Unexploitable by definition. CFR Zinkevich et al., 2007 is the algorithm that made strong poker AI possible. The intuition: Play against yourself repeatedly. After each game, ask: “how much better would I have done if I'd always taken a different action?” That's regret. Update your strategy: play actions proportional to their cumulative positive regret. Repeat thousands of times. The time-average of your strategies converges to Nash equilibrium. I implemented this on “Leduc Hold'em” ‚ a 6-card toy game with 216 information sets, the standard poker AI research testbed. php def get strategy self, reach prob: float - list float : pos = max r, 0.0 for r in self.regrets total = sum pos strat = p / total for p in pos if total 0 else 1/n n Accumulate average strategy this is what converges to Nash for i in range n : self.strat sum i += reach prob strat i return strat After 10,000 iterations 2.2 seconds : That last point surprises people. GTO isn't about always making the "right" move. It's about being unpredictable enough that no opponent strategy can consistently beat you. Full NLHE has ~10¬π‚Å∂‚Å∞ game states. Full tree traversal every iteration is impossible. MCCFR samples a subset each iteration. I used external sampling: if player == traverser: Explore ALL actions, update regrets for action in actions: v = traverse next state, traverser action values action = v update regrets action values else: SAMPLE one opponent action action = sample from strategy strategy return traverse next state, traverser On Leduc Hold'em: MCCFR converged to the same equilibrium as vanilla CFR at 1.9x the speed . Even with MCCFR, full NLHE has too many states. Solution: group similar hands into buckets. The naive approach clusters by average equity, wrong in an important way. A flopped flush draw and a top pair can have identical average equity but completely different equity distributions over future runouts. The flush draw is either far ahead or far behind. The made hand is consistently ahead. GTO strategy treats these differently. The right metric: Earth Mover's Distance between equity histograms. php def emd hist a: np.ndarray, hist b: np.ndarray - float: """Wasserstein-1 distance ‚Äî captures distribution shape, not just mean.""" cdf a = np.cumsum hist a cdf b = np.cumsum hist b return float np.sum np.abs cdf a - cdf b Abstraction scheme: Optimization that halved computation: compute equity histograms for both players simultaneously from the same random rollouts, rather than running separate Monte Carlo samples. Still too many information sets for a hash table. Deep CFR Brown & Sandholm, 2019 replaces the table with neural networks. Two networks per player: class AdvantageNetwork nn.Module : """Approximates cumulative counterfactual regret per action.""" def regret matching self, features : advantages = self.forward features pos = F.relu advantages total = pos.sum dim=-1, keepdim=True uniform = torch.ones like advantages / n return torch.where total 1e-6, pos / total, uniform class StrategyNetwork nn.Module : """Approximates the average strategy ‚Äî this converges to Nash.""" def forward self, x : return F.softmax super .forward x , dim=-1 Feature encoding 373 dimensions : Reservoir buffers ensure all past iterations are represented in training, preventing catastrophic forgetting without unbounded memory. The biggest performance win: keeping networks in eval mode during traversal rather than toggling per inference. Before: called eval on every inference ‚ 61ms/traversal model.eval with torch.no grad : output = model x After: set eval once before traversal loop ‚ 10ms/traversal player.set inference mode called once ... thousands of traversals ... One line of code. 6x speedup. Profile before you optimize. The blueprint strategy from Deep CFR knows a lot but uses coarse abstractions. Real-time search fixes this. At each decision point: php def solve self, gs: GameState, player: int - dict: self.nodes = {} Fresh local tree per decision t start = time.time while self. iters < self.config.n iters: if time.time - t start 1000 = self.config.time limit ms: break self. traverse gs, self. iters % 2, depth=1 self. iters += 1 return self. root node .get avg strategy actions Blueprint bootstrapping blends local and blueprint strategies early in search for stability: blend = min self. iters / self.config.n iters, 1.0 strat = 1 - blend blueprint strat + blend local strat Average decision time: 75ms on CPU . 300 duplicate hand pairs 600 total hands per matchup . Duplicate scoring controls for card luck. | Matchup | mBB/hand | Significant | |---|---|---| | Blueprint vs Random | +28,403 | ‚úì | | Search vs Random | +28,134 | ‚úì | Search vs Blueprint | +31,798 | ‚úì | Search consistently outperforms blueprint-only play. This is the core empirical claim of Pluribus ‚ validated. | This project | Pluribus | | |---|---|---| | Players | 2 heads-up | 6 | | Traversals | ~50,000 | 12.4M | | Hardware | Single CPU | 64-core CPU | | Training | ~30 min | ~8 days | Same architecture. Scale is the difference. 1. Profile earlier. The 6x traversal speedup from eval mode was sitting there the whole time. I spent days on algorithmic optimizations before finding it. 2. Start with finer bet abstraction. Five bet sizes is enough to demonstrate the technique but too coarse for real strategic depth. Pluribus used 14. The strategy changes meaningfully. 3. Build the evaluation framework first. I ran hundreds of training iterations before having reliable exploitability metrics. Convergence looks different than you expect, EV oscillating near zero is not the same as converging to Nash. GitHub: github.com/griff-ui/poker-ai https://github.com/griff-ui/poker-ai Five stages, 40 files, 27 tests passing, full documentation. MIT licensed ‚use it for research, study, or as the foundation for your own solver. Live demo: griff-ui.github.io/poker-ai https://griff-ui.github.io/poker-ai Browser-based hand analyzer, select cards, set game state, see GTO strategy frequencies. No Python required.