Stop letting your AI agent eyeball A/B picks — wire in a real contextual bandit via MCP (free, no key)

A developer demonstrates that using an LLM agent to pick A/B variants is flawed because the model lacks concepts of sample size and exploration, leading to suboptimal decisions. Instead, they show how to route the decision to a deterministic contextual bandit algorithm via the OraClaw MCP server, which can be integrated into any MCP-compatible agent with a single line of configuration. The approach provides auditable, reproducible decisions that properly balance exploration and exploitation.

If you give an LLM agent a table of A/B variants and ask "which one should we send next?", it will confidently pick the one with the highest conversion rate. That feels right. It is often wrong. The model has no concept of sample size , exploration , or regret . It pattern-matches "biggest number = winner" and moves on. For a one-off question, fine. But inside an agent loop that picks a variant on every request — email subject lines, ad copy, model routing, recommendation ranking — that naïve pick quietly accumulates regret and starves the options it never gave a fair chance. The fix isn't a better prompt. It's to not ask the LLM to do the math at all. Route the decision to a real bandit algorithm and let the model do what it's good at orchestration, language while a deterministic solver does what it's good at the optimization . This post is a copy-paste demo you can run in your terminal right now , no signup, no API key. I'll use OraClaw https://www.npmjs.com/package/@oraclaw/mcp-server — a deterministic decision-intelligence MCP server — but the point stands regardless of tool: stop letting the model guess at math it can verify. Here's a realistic state mid-experiment. Three subject lines, different amounts of traffic: | Variant | Pulls | Rewards conversions | Raw rate | |---|---|---|---| | A | 120 | 18 | 15.0% | | B | 80 | 17 | 21.3% | | C | 15 | 4 | 26.7% | Ask an LLM "which should we send next?" and you'll usually get B — it has the best rate among the well-tested variants, and C "only has 15 samples, too noisy to trust." That reasoning sounds responsible. It's exactly backwards. With only 15 pulls, C is under-explored — we don't actually know it's worse, and the cost of finding out is tiny. A bandit's whole job is to weigh that uncertainty instead of hand-waving it away. Let's get a real answer. OraClaw exposes a free, no-auth REST endpoint. Paste this into your terminal — nothing to install, nothing to sign up for: curl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit \ -H "Content-Type: application/json" \ -d '{ "algorithm": "ucb1", "arms": {"id": "variant a", "name": "Subject line A", "pulls": 120, "totalReward": 18}, {"id": "variant b", "name": "Subject line B", "pulls": 80, "totalReward": 17}, {"id": "variant c", "name": "Subject line C", "pulls": 15, "totalReward": 4} }' The response this is the actual output, abbreviated : { "selected": { "id": "variant c", "name": "Subject line C" }, "score": 1.4633997784480877, "algorithm": "ucb1", "exploitation": 0.2666666666666666, "exploration": 1.196733111781421, "regret": { "cumulativeRegret": 18.333333333333314, "averageRegret": 0.08527131782945728, "estimatedOptimalArm": "variant c", "totalPulls": 215 } } UCB1 picks C , and the response shows why in a way you can audit: a low exploitation term its observed rate is mediocre but a high exploration bonus we've barely tested it . The sum — the upper confidence bound — is what it actually optimizes. That's the principled "give the under-sampled option a shot" reasoning the LLM only gestured at. Two things worth noticing: curl again and you get the score: 1.4633997784480877 . UCB1 has no randomness; the same inputs always yield the same decision. That's the difference between a tool you can put in a CI test and a model whose answer drifts run to run. If you "algorithm": "thompson" . The REST call is the proof. The real ergonomics come from MCP — your agent calls it like any other tool, no glue code. Add the server to Claude Code or any MCP client in one line: claude mcp add oraclaw -- npx -y @oraclaw/mcp-server Or drop it straight into a client config: { "mcpServers": { "oraclaw": { "command": "npx", "args": "-y", "@oraclaw/mcp-server" } } } Now your agent has an optimize bandit tool. Instead of prompting the model to reason about exploration, you let it call the solver and act on a verifiable result. The MCP call returns the identical payload same score: 1.4633997784480877 — the MCP server and the REST API are the same engine. The plain bandit assumes the best arm is fixed. Often it isn't — the right model/route/variant depends on the request. That's a contextual bandit, and it's a one-tool swap optimize contextual , a LinUCB implementation . Feed a feature vector describing the current situation: curl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/contextual-bandit \ -H "Content-Type: application/json" \ -d '{ "arms": {"id": "small", "name": "small-fast-model"}, {"id": "mid", "name": "mid-model"}, {"id": "frontier","name": "frontier-model"} , "context": 0.9, 0.2, 1.0 , "history": {"armId": "small", "context": 0.1, 0.1, 0.0 , "reward": 1.0}, {"armId": "frontier", "context": 0.9, 0.2, 1.0 , "reward": 0.95}, {"armId": "small", "context": 0.9, 0.2, 1.0 , "reward": 0.2} }' Here the context vector might encode task difficulty, latency budget, needs reasoning . The model that wins on an easy, latency-sensitive task is not the one that wins on a hard reasoning task — LinUCB learns that mapping from history instead of you maintaining a brittle if difficulty 0.7 ladder by hand. This is the honest version of "let the agent pick which model to call": don't have the LLM introspect about cost/quality tradeoffs in a prompt — give it a learner. The bandit is one of ~20 algorithms in the same server — forecasting ARIMA / Holt-Winters , anomaly detection, linear/MIP optimization HiGHS , Monte Carlo, PageRank/graph analysis, CMA-ES, conformal scoring. Same pattern every time: the agent describes the problem, a deterministic solver returns an answer you can check. curl https://oraclaw-api.onrender.com/api/v1/health lists every endpoint . claude mcp add oraclaw -- npx -y @oraclaw/mcp-server The free MCP tools need no key. curl -s -X POST https://oraclaw-api.onrender.com/api/v1/auth/signup \ -H "Content-Type: application/json" -d '{"email":"you@example.com"}' If you outgrow the free tier, higher limits start at $9/mo — direct checkout here https://oraclaw-api.onrender.com/api/v1/billing/checkout?tier=starter . But you can do everything in this post for $0. If your agent is making decisions, make them ones you can verify. Stop asking the model to eyeball the math — route it to something that gets it provably right. Run the demo, then tell me in the comments what your agent was eyeballing that it shouldn't have been.