# How I Built a Pluribus-Style Poker AI From Scratch

> Source: <https://dev.to/griffin_henning_15c2421a4/how-i-built-a-pluribus-style-poker-ai-from-scratch-2a8b>
> Published: 2026-06-24 05:35:07+00:00

Chess and Go are perfect information games, both players see everything. The challenge is purely computational.

Poker is an **imperfect information game**. You can't see your opponent's cards. Every decision has to account for the full range of hands they might hold.

Standard minimax doesn't work. You can't evaluate a position without knowing your opponent's hand.

The solution is **Nash equilibrium**, a strategy where neither player can improve their EV by unilaterally changing what they do. In poker this is called GTO (Game-Theoretically Optimal). Unexploitable by definition.

CFR (Zinkevich et al., 2007) is the algorithm that made strong poker AI possible.

**The intuition:** Play against yourself repeatedly. After each game, ask: “how much better would I have done if I'd always taken a different action?” That's regret.

Update your strategy: play actions proportional to their cumulative positive regret. Repeat thousands of times. The **time-average** of your strategies converges to Nash equilibrium.

I implemented this on “Leduc Hold'em” ‚ a 6-card toy game with 216 information sets, the standard poker AI research testbed.

``` php
def get_strategy(self, reach_prob: float) -> list[float]:
    pos = [max(r, 0.0) for r in self.regrets]
    total = sum(pos)
    strat = [p / total for p in pos] if total > 0 else [1/n] * n
    # Accumulate average strategy (this is what converges to Nash)
    for i in range(n):
        self.strat_sum[i] += reach_prob * strat[i]
    return strat
```

After 10,000 iterations (2.2 seconds):

That last point surprises people. GTO isn't about always making the "right" move. It's about being unpredictable enough that no opponent strategy can consistently beat you.

Full NLHE has ~10¬π‚Å∂‚Å∞ game states. Full tree traversal every iteration is impossible.

**MCCFR** samples a subset each iteration. I used external sampling:

```
if player == traverser:
    # Explore ALL actions, update regrets
    for action in actions:
        v = traverse(next_state, traverser)
        action_values[action] = v
    update_regrets(action_values)
else:
    # SAMPLE one opponent action
    action = sample_from_strategy(strategy)
    return traverse(next_state, traverser)
```

On Leduc Hold'em: MCCFR converged to the same equilibrium as vanilla CFR at **1.9x the speed**.

Even with MCCFR, full NLHE has too many states. Solution: group similar hands into buckets.

**The naive approach** clusters by average equity, wrong in an important way.

A flopped flush draw and a top pair can have identical average equity but completely different equity *distributions* over future runouts. The flush draw is either far ahead or far behind. The made hand is consistently ahead. GTO strategy treats these differently.

**The right metric:** Earth Mover's Distance between equity histograms.

``` php
def emd(hist_a: np.ndarray, hist_b: np.ndarray) -> float:
    """Wasserstein-1 distance ‚Äî captures distribution shape, not just mean."""
    cdf_a = np.cumsum(hist_a)
    cdf_b = np.cumsum(hist_b)
    return float(np.sum(np.abs(cdf_a - cdf_b)))
```

**Abstraction scheme:**

**Optimization that halved computation:** compute equity histograms for both players simultaneously from the same random rollouts, rather than running separate Monte Carlo samples.

Still too many information sets for a hash table. Deep CFR (Brown & Sandholm, 2019) replaces the table with neural networks.

**Two networks per player:**

```
class AdvantageNetwork(nn.Module):
    """Approximates cumulative counterfactual regret per action."""
    def regret_matching(self, features):
        advantages = self.forward(features)
        pos = F.relu(advantages)
        total = pos.sum(dim=-1, keepdim=True)
        uniform = torch.ones_like(advantages) / n
        return torch.where(total > 1e-6, pos / total, uniform)

class StrategyNetwork(nn.Module):
    """Approximates the average strategy ‚Äî this converges to Nash."""
    def forward(self, x):
        return F.softmax(super().forward(x), dim=-1)
```

**Feature encoding (373 dimensions):**

**Reservoir buffers** ensure all past iterations are represented in training, preventing catastrophic forgetting without unbounded memory.

**The biggest performance win:** keeping networks in `eval()`

mode during traversal rather than toggling per inference.

```
# Before: called eval() on every inference ‚ 61ms/traversal
model.eval()
with torch.no_grad():
    output = model(x)

# After: set eval() once before traversal loop ‚ 10ms/traversal
player.set_inference_mode()  # called once
# ... thousands of traversals ...
```

One line of code. **6x speedup.** Profile before you optimize.

The blueprint strategy from Deep CFR knows a lot but uses coarse abstractions. Real-time search fixes this.

At each decision point:

``` php
def solve(self, gs: GameState, player: int) -> dict:
    self.nodes = {}  # Fresh local tree per decision
    t_start = time.time()

    while self._iters < self.config.n_iters:
        if (time.time() - t_start) * 1000 >= self.config.time_limit_ms:
            break
        self._traverse(gs, self._iters % 2, depth=1)
        self._iters += 1

    return self._root_node().get_avg_strategy(actions)
```

**Blueprint bootstrapping** blends local and blueprint strategies early in search for stability:

```
blend = min(self._iters / self.config.n_iters, 1.0)
strat = (1 - blend) * blueprint_strat + blend * local_strat
```

Average decision time: **75ms on CPU**.

300 duplicate hand pairs (600 total hands per matchup). Duplicate scoring controls for card luck.

| Matchup | mBB/hand | Significant |
|---|---|---|
| Blueprint vs Random | +28,403 | ‚úì |
| Search vs Random | +28,134 | ‚úì |
Search vs Blueprint |
+31,798 |
‚úì |

Search consistently outperforms blueprint-only play. This is the core empirical claim of Pluribus ‚ validated.

| This project | Pluribus | |
|---|---|---|
| Players | 2 (heads-up) | 6 |
| Traversals | ~50,000 | 12.4M |
| Hardware | Single CPU | 64-core CPU |
| Training | ~30 min | ~8 days |

Same architecture. Scale is the difference.

**1. Profile earlier.** The 6x traversal speedup from eval mode was sitting there the whole time. I spent days on algorithmic optimizations before finding it.

**2. Start with finer bet abstraction.** Five bet sizes is enough to demonstrate the technique but too coarse for real strategic depth. Pluribus used 14. The strategy changes meaningfully.

**3. Build the evaluation framework first.** I ran hundreds of training iterations before having reliable exploitability metrics. Convergence looks different than you expect, EV oscillating near zero is not the same as converging to Nash.

**GitHub:** [github.com/griff-ui/poker-ai](https://github.com/griff-ui/poker-ai)

Five stages, 40 files, 27 tests passing, full documentation. MIT licensed ‚use it for research, study, or as the foundation for your own solver.

**Live demo:** [griff-ui.github.io/poker-ai](https://griff-ui.github.io/poker-ai)

Browser-based hand analyzer, select cards, set game state, see GTO strategy frequencies. No Python required.