cd /news/machine-learning/stop-letting-your-ai-agent-eyeball-a… · home topics machine-learning article
[ARTICLE · art-37370] src=dev.to ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

Stop letting your AI agent eyeball A/B picks — wire in a real contextual bandit via MCP (free, no key)

A developer demonstrates that using an LLM agent to pick A/B variants is flawed because the model lacks concepts of sample size and exploration, leading to suboptimal decisions. Instead, they show how to route the decision to a deterministic contextual bandit algorithm via the OraClaw MCP server, which can be integrated into any MCP-compatible agent with a single line of configuration. The approach provides auditable, reproducible decisions that properly balance exploration and exploitation.

read5 min views1 publishedJun 24, 2026

If you give an LLM agent a table of A/B variants and ask "which one should we send next?", it will confidently pick the one with the highest conversion rate.

That feels right. It is often wrong.

The model has no concept of sample size, exploration, or regret. It pattern-matches "biggest number = winner" and moves on. For a one-off question, fine. But inside an agent loop that picks a variant on every request — email subject lines, ad copy, model routing, recommendation ranking — that naïve pick quietly accumulates regret and starves the options it never gave a fair chance.

The fix isn't a better prompt. It's to not ask the LLM to do the math at all. Route the decision to a real bandit algorithm and let the model do what it's good at (orchestration, language) while a deterministic solver does what it's good at (the optimization).

This post is a copy-paste demo you can run in your terminal right now, no signup, no API key. I'll use OraClaw — a deterministic decision-intelligence MCP server — but the point stands regardless of tool: stop letting the model guess at math it can verify.

Here's a realistic state mid-experiment. Three subject lines, different amounts of traffic:

Variant Pulls Rewards (conversions) Raw rate
A 120 18 15.0%
B 80 17 21.3%
C 15 4 26.7%

Ask an LLM "which should we send next?" and you'll usually get B — it has the best rate among the well-tested variants, and C "only has 15 samples, too noisy to trust."

That reasoning sounds responsible. It's exactly backwards. With only 15 pulls, C is under-explored — we don't actually know it's worse, and the cost of finding out is tiny. A bandit's whole job is to weigh that uncertainty instead of hand-waving it away.

Let's get a real answer.

OraClaw exposes a free, no-auth REST endpoint. Paste this into your terminal — nothing to install, nothing to sign up for:

curl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit \
  -H "Content-Type: application/json" \
  -d '{
    "algorithm": "ucb1",
    "arms": [
      {"id": "variant_a", "name": "Subject line A", "pulls": 120, "totalReward": 18},
      {"id": "variant_b", "name": "Subject line B", "pulls": 80,  "totalReward": 17},
      {"id": "variant_c", "name": "Subject line C", "pulls": 15,  "totalReward": 4}
    ]
  }'

The response (this is the actual output, abbreviated):

{
  "selected": { "id": "variant_c", "name": "Subject line C" },
  "score": 1.4633997784480877,
  "algorithm": "ucb1",
  "exploitation": 0.2666666666666666,
  "exploration": 1.196733111781421,
  "regret": {
    "cumulativeRegret": 18.333333333333314,
    "averageRegret": 0.08527131782945728,
    "estimatedOptimalArm": "variant_c",
    "totalPulls": 215
  }
}

UCB1 picks C, and the response shows why in a way you can audit: a low exploitation

term (its observed rate is mediocre) but a high exploration

bonus (we've barely tested it). The sum — the upper confidence bound — is what it actually optimizes. That's the principled "give the under-sampled option a shot" reasoning the LLM only gestured at.

Two things worth noticing:

curl

again and you get the score: 1.4633997784480877

. UCB1 has no randomness; the same inputs always yield the same decision. That's the difference between a tool you can put in a CI test and a model whose answer drifts run to run. (If you "algorithm": "thompson"

.)The REST call is the proof. The real ergonomics come from MCP — your agent calls it like any other tool, no glue code.

Add the server to Claude Code (or any MCP client) in one line:

claude mcp add oraclaw -- npx -y @oraclaw/mcp-server

Or drop it straight into a client config:

{
  "mcpServers": {
    "oraclaw": {
      "command": "npx",
      "args": ["-y", "@oraclaw/mcp-server"]
    }
  }
}

Now your agent has an optimize_bandit

tool. Instead of prompting the model to reason about exploration, you let it call the solver and act on a verifiable result. The MCP call returns the identical payload (same score: 1.4633997784480877

) — the MCP server and the REST API are the same engine.

The plain bandit assumes the best arm is fixed. Often it isn't — the right model/route/variant depends on the request. That's a contextual bandit, and it's a one-tool swap (optimize_contextual

, a LinUCB implementation). Feed a feature vector describing the current situation:

curl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/contextual-bandit \
  -H "Content-Type: application/json" \
  -d '{
    "arms": [
      {"id": "small",   "name": "small-fast-model"},
      {"id": "mid",     "name": "mid-model"},
      {"id": "frontier","name": "frontier-model"}
    ],
    "context": [0.9, 0.2, 1.0],
    "history": [
      {"armId": "small",    "context": [0.1, 0.1, 0.0], "reward": 1.0},
      {"armId": "frontier", "context": [0.9, 0.2, 1.0], "reward": 0.95},
      {"armId": "small",    "context": [0.9, 0.2, 1.0], "reward": 0.2}
    ]
  }'

Here the context vector might encode [task_difficulty, latency_budget, needs_reasoning]

. The model that wins on an easy, latency-sensitive task is not the one that wins on a hard reasoning task — LinUCB learns that mapping from history instead of you maintaining a brittle if difficulty > 0.7

ladder by hand. This is the honest version of "let the agent pick which model to call": don't have the LLM introspect about cost/quality tradeoffs in a prompt — give it a learner.

The bandit is one of ~20 algorithms in the same server — forecasting (ARIMA / Holt-Winters), anomaly detection, linear/MIP optimization (HiGHS), Monte Carlo, PageRank/graph analysis, CMA-ES, conformal scoring. Same pattern every time: the agent describes the problem, a deterministic solver returns an answer you can check.

curl

https://oraclaw-api.onrender.com/api/v1/health

(lists every endpoint).

   claude mcp add oraclaw -- npx -y @oraclaw/mcp-server

The free MCP tools need no key.

   curl -s -X POST https://oraclaw-api.onrender.com/api/v1/auth/signup \
     -H "Content-Type: application/json" -d '{"email":"you@example.com"}'

If you outgrow the free tier, higher limits start at $9/mo — direct checkout here. But you can do everything in this post for $0.

If your agent is making decisions, make them ones you can verify. Stop asking the model to eyeball the math — route it to something that gets it provably right.

Run the demo, then tell me in the comments what your agent was eyeballing that it shouldn't have been.

── more in #machine-learning 4 stories · sorted by recency
── more on @oraclaw 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/stop-letting-your-ai…] indexed:0 read:5min 2026-06-24 ·