{"slug": "how-i-built-a-pluribus-style-poker-ai-from-scratch", "title": "How I Built a Pluribus-Style Poker AI From Scratch", "summary": "A developer built a Pluribus-style poker AI from scratch, implementing Counterfactual Regret Minimization (CFR) and its variants to solve imperfect-information games. The project achieved Nash equilibrium convergence in Leduc Hold'em and explored abstraction techniques for full-scale no-limit Texas Hold'em.", "body_md": "Chess and Go are perfect information games, both players see everything. The challenge is purely computational.\n\nPoker is an **imperfect information game**. You can't see your opponent's cards. Every decision has to account for the full range of hands they might hold.\n\nStandard minimax doesn't work. You can't evaluate a position without knowing your opponent's hand.\n\nThe solution is **Nash equilibrium**, a strategy where neither player can improve their EV by unilaterally changing what they do. In poker this is called GTO (Game-Theoretically Optimal). Unexploitable by definition.\n\nCFR (Zinkevich et al., 2007) is the algorithm that made strong poker AI possible.\n\n**The intuition:** Play against yourself repeatedly. After each game, ask: “how much better would I have done if I'd always taken a different action?” That's regret.\n\nUpdate your strategy: play actions proportional to their cumulative positive regret. Repeat thousands of times. The **time-average** of your strategies converges to Nash equilibrium.\n\nI implemented this on “Leduc Hold'em” ‚ a 6-card toy game with 216 information sets, the standard poker AI research testbed.\n\n``` php\ndef get_strategy(self, reach_prob: float) -> list[float]:\n    pos = [max(r, 0.0) for r in self.regrets]\n    total = sum(pos)\n    strat = [p / total for p in pos] if total > 0 else [1/n] * n\n    # Accumulate average strategy (this is what converges to Nash)\n    for i in range(n):\n        self.strat_sum[i] += reach_prob * strat[i]\n    return strat\n```\n\nAfter 10,000 iterations (2.2 seconds):\n\nThat last point surprises people. GTO isn't about always making the \"right\" move. It's about being unpredictable enough that no opponent strategy can consistently beat you.\n\nFull NLHE has ~10¬π‚Å∂‚Å∞ game states. Full tree traversal every iteration is impossible.\n\n**MCCFR** samples a subset each iteration. I used external sampling:\n\n```\nif player == traverser:\n    # Explore ALL actions, update regrets\n    for action in actions:\n        v = traverse(next_state, traverser)\n        action_values[action] = v\n    update_regrets(action_values)\nelse:\n    # SAMPLE one opponent action\n    action = sample_from_strategy(strategy)\n    return traverse(next_state, traverser)\n```\n\nOn Leduc Hold'em: MCCFR converged to the same equilibrium as vanilla CFR at **1.9x the speed**.\n\nEven with MCCFR, full NLHE has too many states. Solution: group similar hands into buckets.\n\n**The naive approach** clusters by average equity, wrong in an important way.\n\nA flopped flush draw and a top pair can have identical average equity but completely different equity *distributions* over future runouts. The flush draw is either far ahead or far behind. The made hand is consistently ahead. GTO strategy treats these differently.\n\n**The right metric:** Earth Mover's Distance between equity histograms.\n\n``` php\ndef emd(hist_a: np.ndarray, hist_b: np.ndarray) -> float:\n    \"\"\"Wasserstein-1 distance ‚Äî captures distribution shape, not just mean.\"\"\"\n    cdf_a = np.cumsum(hist_a)\n    cdf_b = np.cumsum(hist_b)\n    return float(np.sum(np.abs(cdf_a - cdf_b)))\n```\n\n**Abstraction scheme:**\n\n**Optimization that halved computation:** compute equity histograms for both players simultaneously from the same random rollouts, rather than running separate Monte Carlo samples.\n\nStill too many information sets for a hash table. Deep CFR (Brown & Sandholm, 2019) replaces the table with neural networks.\n\n**Two networks per player:**\n\n```\nclass AdvantageNetwork(nn.Module):\n    \"\"\"Approximates cumulative counterfactual regret per action.\"\"\"\n    def regret_matching(self, features):\n        advantages = self.forward(features)\n        pos = F.relu(advantages)\n        total = pos.sum(dim=-1, keepdim=True)\n        uniform = torch.ones_like(advantages) / n\n        return torch.where(total > 1e-6, pos / total, uniform)\n\nclass StrategyNetwork(nn.Module):\n    \"\"\"Approximates the average strategy ‚Äî this converges to Nash.\"\"\"\n    def forward(self, x):\n        return F.softmax(super().forward(x), dim=-1)\n```\n\n**Feature encoding (373 dimensions):**\n\n**Reservoir buffers** ensure all past iterations are represented in training, preventing catastrophic forgetting without unbounded memory.\n\n**The biggest performance win:** keeping networks in `eval()`\n\nmode during traversal rather than toggling per inference.\n\n```\n# Before: called eval() on every inference ‚ 61ms/traversal\nmodel.eval()\nwith torch.no_grad():\n    output = model(x)\n\n# After: set eval() once before traversal loop ‚ 10ms/traversal\nplayer.set_inference_mode()  # called once\n# ... thousands of traversals ...\n```\n\nOne line of code. **6x speedup.** Profile before you optimize.\n\nThe blueprint strategy from Deep CFR knows a lot but uses coarse abstractions. Real-time search fixes this.\n\nAt each decision point:\n\n``` php\ndef solve(self, gs: GameState, player: int) -> dict:\n    self.nodes = {}  # Fresh local tree per decision\n    t_start = time.time()\n\n    while self._iters < self.config.n_iters:\n        if (time.time() - t_start) * 1000 >= self.config.time_limit_ms:\n            break\n        self._traverse(gs, self._iters % 2, depth=1)\n        self._iters += 1\n\n    return self._root_node().get_avg_strategy(actions)\n```\n\n**Blueprint bootstrapping** blends local and blueprint strategies early in search for stability:\n\n```\nblend = min(self._iters / self.config.n_iters, 1.0)\nstrat = (1 - blend) * blueprint_strat + blend * local_strat\n```\n\nAverage decision time: **75ms on CPU**.\n\n300 duplicate hand pairs (600 total hands per matchup). Duplicate scoring controls for card luck.\n\n| Matchup | mBB/hand | Significant |\n|---|---|---|\n| Blueprint vs Random | +28,403 | ‚úì |\n| Search vs Random | +28,134 | ‚úì |\nSearch vs Blueprint |\n+31,798 |\n‚úì |\n\nSearch consistently outperforms blueprint-only play. This is the core empirical claim of Pluribus ‚ validated.\n\n| This project | Pluribus | |\n|---|---|---|\n| Players | 2 (heads-up) | 6 |\n| Traversals | ~50,000 | 12.4M |\n| Hardware | Single CPU | 64-core CPU |\n| Training | ~30 min | ~8 days |\n\nSame architecture. Scale is the difference.\n\n**1. Profile earlier.** The 6x traversal speedup from eval mode was sitting there the whole time. I spent days on algorithmic optimizations before finding it.\n\n**2. Start with finer bet abstraction.** Five bet sizes is enough to demonstrate the technique but too coarse for real strategic depth. Pluribus used 14. The strategy changes meaningfully.\n\n**3. Build the evaluation framework first.** I ran hundreds of training iterations before having reliable exploitability metrics. Convergence looks different than you expect, EV oscillating near zero is not the same as converging to Nash.\n\n**GitHub:** [github.com/griff-ui/poker-ai](https://github.com/griff-ui/poker-ai)\n\nFive stages, 40 files, 27 tests passing, full documentation. MIT licensed ‚use it for research, study, or as the foundation for your own solver.\n\n**Live demo:** [griff-ui.github.io/poker-ai](https://griff-ui.github.io/poker-ai)\n\nBrowser-based hand analyzer, select cards, set game state, see GTO strategy frequencies. No Python required.", "url": "https://wpnews.pro/news/how-i-built-a-pluribus-style-poker-ai-from-scratch", "canonical_source": "https://dev.to/griffin_henning_15c2421a4/how-i-built-a-pluribus-style-poker-ai-from-scratch-2a8b", "published_at": "2026-06-24 05:35:07+00:00", "updated_at": "2026-06-24 06:13:38.866986+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-research"], "entities": ["Pluribus", "CFR", "MCCFR", "Deep CFR", "Leduc Hold'em", "No-Limit Texas Hold'em", "Zinkevich", "Brown & Sandholm"], "alternates": {"html": "https://wpnews.pro/news/how-i-built-a-pluribus-style-poker-ai-from-scratch", "markdown": "https://wpnews.pro/news/how-i-built-a-pluribus-style-poker-ai-from-scratch.md", "text": "https://wpnews.pro/news/how-i-built-a-pluribus-style-poker-ai-from-scratch.txt", "jsonld": "https://wpnews.pro/news/how-i-built-a-pluribus-style-poker-ai-from-scratch.jsonld"}}