{"slug": "stop-letting-your-ai-agent-eyeball-a-b-picks-wire-in-a-real-contextual-bandit-no", "title": "Stop letting your AI agent eyeball A/B picks — wire in a real contextual bandit via MCP (free, no key)", "summary": "A developer demonstrates that using an LLM agent to pick A/B variants is flawed because the model lacks concepts of sample size and exploration, leading to suboptimal decisions. Instead, they show how to route the decision to a deterministic contextual bandit algorithm via the OraClaw MCP server, which can be integrated into any MCP-compatible agent with a single line of configuration. The approach provides auditable, reproducible decisions that properly balance exploration and exploitation.", "body_md": "If you give an LLM agent a table of A/B variants and ask \"which one should we send next?\", it will confidently pick the one with the highest conversion rate.\n\nThat feels right. It is often wrong.\n\nThe model has no concept of *sample size*, *exploration*, or *regret*. It pattern-matches \"biggest number = winner\" and moves on. For a one-off question, fine. But inside an agent loop that picks a variant on every request — email subject lines, ad copy, model routing, recommendation ranking — that naïve pick quietly accumulates regret and starves the options it never gave a fair chance.\n\nThe fix isn't a better prompt. It's to **not ask the LLM to do the math at all.** Route the decision to a real bandit algorithm and let the model do what it's good at (orchestration, language) while a deterministic solver does what *it's* good at (the optimization).\n\nThis post is a copy-paste demo you can run in your terminal **right now**, no signup, no API key. I'll use [OraClaw](https://www.npmjs.com/package/@oraclaw/mcp-server) — a deterministic decision-intelligence MCP server — but the point stands regardless of tool: stop letting the model guess at math it can verify.\n\nHere's a realistic state mid-experiment. Three subject lines, different amounts of traffic:\n\n| Variant | Pulls | Rewards (conversions) | Raw rate |\n|---|---|---|---|\n| A | 120 | 18 | 15.0% |\n| B | 80 | 17 | 21.3% |\n| C | 15 | 4 | 26.7% |\n\nAsk an LLM \"which should we send next?\" and you'll usually get **B** — it has the best rate among the well-tested variants, and C \"only has 15 samples, too noisy to trust.\"\n\nThat reasoning sounds responsible. It's exactly backwards. With only 15 pulls, C is *under-explored* — we don't actually know it's worse, and the cost of finding out is tiny. A bandit's whole job is to weigh that uncertainty instead of hand-waving it away.\n\nLet's get a real answer.\n\nOraClaw exposes a free, no-auth REST endpoint. Paste this into your terminal — nothing to install, nothing to sign up for:\n\n```\ncurl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"algorithm\": \"ucb1\",\n    \"arms\": [\n      {\"id\": \"variant_a\", \"name\": \"Subject line A\", \"pulls\": 120, \"totalReward\": 18},\n      {\"id\": \"variant_b\", \"name\": \"Subject line B\", \"pulls\": 80,  \"totalReward\": 17},\n      {\"id\": \"variant_c\", \"name\": \"Subject line C\", \"pulls\": 15,  \"totalReward\": 4}\n    ]\n  }'\n```\n\nThe response (this is the actual output, abbreviated):\n\n```\n{\n  \"selected\": { \"id\": \"variant_c\", \"name\": \"Subject line C\" },\n  \"score\": 1.4633997784480877,\n  \"algorithm\": \"ucb1\",\n  \"exploitation\": 0.2666666666666666,\n  \"exploration\": 1.196733111781421,\n  \"regret\": {\n    \"cumulativeRegret\": 18.333333333333314,\n    \"averageRegret\": 0.08527131782945728,\n    \"estimatedOptimalArm\": \"variant_c\",\n    \"totalPulls\": 215\n  }\n}\n```\n\nUCB1 picks **C**, and the response shows *why* in a way you can audit: a low `exploitation`\n\nterm (its observed rate is mediocre) but a high `exploration`\n\nbonus (we've barely tested it). The sum — the upper confidence bound — is what it actually optimizes. That's the principled \"give the under-sampled option a shot\" reasoning the LLM only gestured at.\n\nTwo things worth noticing:\n\n`curl`\n\nagain and you get the `score: 1.4633997784480877`\n\n. UCB1 has no randomness; the same inputs always yield the same decision. That's the difference between a tool you can put in a CI test and a model whose answer drifts run to run. (If you `\"algorithm\": \"thompson\"`\n\n.)The REST call is the proof. The real ergonomics come from MCP — your agent calls it like any other tool, no glue code.\n\nAdd the server to Claude Code (or any MCP client) in one line:\n\n```\nclaude mcp add oraclaw -- npx -y @oraclaw/mcp-server\n```\n\nOr drop it straight into a client config:\n\n```\n{\n  \"mcpServers\": {\n    \"oraclaw\": {\n      \"command\": \"npx\",\n      \"args\": [\"-y\", \"@oraclaw/mcp-server\"]\n    }\n  }\n}\n```\n\nNow your agent has an `optimize_bandit`\n\ntool. Instead of *prompting* the model to reason about exploration, you let it call the solver and act on a verifiable result. The MCP call returns the identical payload (same `score: 1.4633997784480877`\n\n) — the MCP server and the REST API are the same engine.\n\nThe plain bandit assumes the best arm is fixed. Often it isn't — the right model/route/variant depends on the request. That's a **contextual** bandit, and it's a one-tool swap (`optimize_contextual`\n\n, a LinUCB implementation). Feed a feature vector describing the current situation:\n\n```\ncurl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/contextual-bandit \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"arms\": [\n      {\"id\": \"small\",   \"name\": \"small-fast-model\"},\n      {\"id\": \"mid\",     \"name\": \"mid-model\"},\n      {\"id\": \"frontier\",\"name\": \"frontier-model\"}\n    ],\n    \"context\": [0.9, 0.2, 1.0],\n    \"history\": [\n      {\"armId\": \"small\",    \"context\": [0.1, 0.1, 0.0], \"reward\": 1.0},\n      {\"armId\": \"frontier\", \"context\": [0.9, 0.2, 1.0], \"reward\": 0.95},\n      {\"armId\": \"small\",    \"context\": [0.9, 0.2, 1.0], \"reward\": 0.2}\n    ]\n  }'\n```\n\nHere the context vector might encode `[task_difficulty, latency_budget, needs_reasoning]`\n\n. The model that wins on an easy, latency-sensitive task is not the one that wins on a hard reasoning task — LinUCB learns that mapping from history instead of you maintaining a brittle `if difficulty > 0.7`\n\nladder by hand. This is the honest version of \"let the agent pick which model to call\": don't have the LLM introspect about cost/quality tradeoffs in a prompt — give it a learner.\n\nThe bandit is one of ~20 algorithms in the same server — forecasting (ARIMA / Holt-Winters), anomaly detection, linear/MIP optimization (HiGHS), Monte Carlo, PageRank/graph analysis, CMA-ES, conformal scoring. Same pattern every time: the agent describes the problem, a deterministic solver returns an answer you can check.\n\n`curl`\n\n`https://oraclaw-api.onrender.com/api/v1/health`\n\n(lists every endpoint).\n\n```\n   claude mcp add oraclaw -- npx -y @oraclaw/mcp-server\n```\n\nThe free MCP tools need no key.\n\n```\n   curl -s -X POST https://oraclaw-api.onrender.com/api/v1/auth/signup \\\n     -H \"Content-Type: application/json\" -d '{\"email\":\"you@example.com\"}'\n```\n\nIf you outgrow the free tier, higher limits start at $9/mo — [direct checkout here](https://oraclaw-api.onrender.com/api/v1/billing/checkout?tier=starter). But you can do everything in this post for $0.\n\nIf your agent is making decisions, make them ones you can verify. Stop asking the model to eyeball the math — route it to something that gets it provably right.\n\n*Run the demo, then tell me in the comments what your agent was eyeballing that it shouldn't have been.*", "url": "https://wpnews.pro/news/stop-letting-your-ai-agent-eyeball-a-b-picks-wire-in-a-real-contextual-bandit-no", "canonical_source": "https://dev.to/whatsonyourmind/stop-letting-your-ai-agent-eyeball-ab-picks-wire-in-a-real-contextual-bandit-via-mcp-free-no-gi1", "published_at": "2026-06-24 06:33:45+00:00", "updated_at": "2026-06-24 06:43:48.100091+00:00", "lang": "en", "topics": ["machine-learning", "ai-agents", "developer-tools", "large-language-models"], "entities": ["OraClaw", "UCB1", "MCP", "Claude Code"], "alternates": {"html": "https://wpnews.pro/news/stop-letting-your-ai-agent-eyeball-a-b-picks-wire-in-a-real-contextual-bandit-no", "markdown": "https://wpnews.pro/news/stop-letting-your-ai-agent-eyeball-a-b-picks-wire-in-a-real-contextual-bandit-no.md", "text": "https://wpnews.pro/news/stop-letting-your-ai-agent-eyeball-a-b-picks-wire-in-a-real-contextual-bandit-no.txt", "jsonld": "https://wpnews.pro/news/stop-letting-your-ai-agent-eyeball-a-b-picks-wire-in-a-real-contextual-bandit-no.jsonld"}}