How I Built a Pluribus-Style Poker AI From Scratch

wpnews.pro

Chess and Go are perfect information games, both players see everything. The challenge is purely computational.

Poker is an imperfect information game. You can't see your opponent's cards. Every decision has to account for the full range of hands they might hold.

Standard minimax doesn't work. You can't evaluate a position without knowing your opponent's hand.

The solution is Nash equilibrium, a strategy where neither player can improve their EV by unilaterally changing what they do. In poker this is called GTO (Game-Theoretically Optimal). Unexploitable by definition.

CFR (Zinkevich et al., 2007) is the algorithm that made strong poker AI possible.

The intuition: Play against yourself repeatedly. After each game, ask: “how much better would I have done if I'd always taken a different action?” That's regret.

Update your strategy: play actions proportional to their cumulative positive regret. Repeat thousands of times. The time-average of your strategies converges to Nash equilibrium.

I implemented this on “Leduc Hold'em” ‚ a 6-card toy game with 216 information sets, the standard poker AI research testbed.

def get_strategy(self, reach_prob: float) -> list[float]:
    pos = [max(r, 0.0) for r in self.regrets]
    total = sum(pos)
    strat = [p / total for p in pos] if total > 0 else [1/n] * n
    for i in range(n):
        self.strat_sum[i] += reach_prob * strat[i]
    return strat

After 10,000 iterations (2.2 seconds):

That last point surprises people. GTO isn't about always making the "right" move. It's about being unpredictable enough that no opponent strategy can consistently beat you.

Full NLHE has ~10¬π‚Å∂‚Å∞ game states. Full tree traversal every iteration is impossible.

MCCFR samples a subset each iteration. I used external sampling:

if player == traverser:
    for action in actions:
        v = traverse(next_state, traverser)
        action_values[action] = v
    update_regrets(action_values)
else:
    action = sample_from_strategy(strategy)
    return traverse(next_state, traverser)

On Leduc Hold'em: MCCFR converged to the same equilibrium as vanilla CFR at 1.9x the speed.

Even with MCCFR, full NLHE has too many states. Solution: group similar hands into buckets.

The naive approach clusters by average equity, wrong in an important way.

A flopped flush draw and a top pair can have identical average equity but completely different equity distributions over future runouts. The flush draw is either far ahead or far behind. The made hand is consistently ahead. GTO strategy treats these differently.

The right metric: Earth Mover's Distance between equity histograms.

def emd(hist_a: np.ndarray, hist_b: np.ndarray) -> float:
    """Wasserstein-1 distance ‚Äî captures distribution shape, not just mean."""
    cdf_a = np.cumsum(hist_a)
    cdf_b = np.cumsum(hist_b)
    return float(np.sum(np.abs(cdf_a - cdf_b)))

Abstraction scheme:

Optimization that halved computation: compute equity histograms for both players simultaneously from the same random rollouts, rather than running separate Monte Carlo samples.

Still too many information sets for a hash table. Deep CFR (Brown & Sandholm, 2019) replaces the table with neural networks.

Two networks per player:

class AdvantageNetwork(nn.Module):
    """Approximates cumulative counterfactual regret per action."""
    def regret_matching(self, features):
        advantages = self.forward(features)
        pos = F.relu(advantages)
        total = pos.sum(dim=-1, keepdim=True)
        uniform = torch.ones_like(advantages) / n
        return torch.where(total > 1e-6, pos / total, uniform)

class StrategyNetwork(nn.Module):
    """Approximates the average strategy ‚Äî this converges to Nash."""
    def forward(self, x):
        return F.softmax(super().forward(x), dim=-1)

Feature encoding (373 dimensions):

Reservoir buffers ensure all past iterations are represented in training, preventing catastrophic forgetting without unbounded memory.

The biggest performance win: keeping networks in eval()

mode during traversal rather than toggling per inference.

model.eval()
with torch.no_grad():
    output = model(x)

player.set_inference_mode()  # called once

One line of code. 6x speedup. Profile before you optimize.

The blueprint strategy from Deep CFR knows a lot but uses coarse abstractions. Real-time search fixes this.

At each decision point:

def solve(self, gs: GameState, player: int) -> dict:
    self.nodes = {}  # Fresh local tree per decision
    t_start = time.time()

    while self._iters < self.config.n_iters:
        if (time.time() - t_start) * 1000 >= self.config.time_limit_ms:
            break
        self._traverse(gs, self._iters % 2, depth=1)
        self._iters += 1

    return self._root_node().get_avg_strategy(actions)

Blueprint bootstrapping blends local and blueprint strategies early in search for stability:

blend = min(self._iters / self.config.n_iters, 1.0)
strat = (1 - blend) * blueprint_strat + blend * local_strat

Average decision time: 75ms on CPU.

300 duplicate hand pairs (600 total hands per matchup). Duplicate scoring controls for card luck.

Matchup	mBB/hand	Significant
Blueprint vs Random	+28,403	‚úì
Search vs Random	+28,134	‚úì
Search vs Blueprint
+31,798
‚úì

Search consistently outperforms blueprint-only play. This is the core empirical claim of Pluribus ‚ validated.

This project	Pluribus
Players	2 (heads-up)	6
Traversals	~50,000	12.4M
Hardware	Single CPU	64-core CPU
Training	~30 min	~8 days

Same architecture. Scale is the difference.

1. Profile earlier. The 6x traversal speedup from eval mode was sitting there the whole time. I spent days on algorithmic optimizations before finding it.

2. Start with finer bet abstraction. Five bet sizes is enough to demonstrate the technique but too coarse for real strategic depth. Pluribus used 14. The strategy changes meaningfully.

3. Build the evaluation framework first. I ran hundreds of training iterations before having reliable exploitability metrics. Convergence looks different than you expect, EV oscillating near zero is not the same as converging to Nash.

GitHub: github.com/griff-ui/poker-ai

Five stages, 40 files, 27 tests passing, full documentation. MIT licensed ‚use it for research, study, or as the foundation for your own solver.

Live demo: griff-ui.github.io/poker-ai

Browser-based hand analyzer, select cards, set game state, see GTO strategy frequencies. No Python required.

source & further reading

dev.to — original article Jarvis AI Platform: Implementing Semantic Memory Retrieval with pgvector MCP Logging: What I Wish I Knew Before Deploying My Production MCP Server (3 Weeks of Production Pain) Pydantic passed. Types matched. The downstream system still got garbage.

How I Built a Pluribus-Style Poker AI From Scratch

Run your AI side-project on zahid.host