{"slug": "which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out", "title": "Which LLM is the best stock picker? I built a benchmark to find out.", "summary": "A benchmark called \"1rok\" that evaluates seven frontier large language models (LLMs) as stock pickers, giving each $100,000 in paper capital and identical tools to select stocks weekly. The project, which started on January 20, 2026, aims to test models on decision-making under uncertainty rather than on standard coding or math benchmarks, with results tracked on a live leaderboard. The author notes that the goal is not to beat the S&P 500 but to provide an objective, downstream task for comparing LLM performance in planning, tool use, and committing to decisions.", "body_md": "Every other week there's a new GPT-vs-Claude-vs-Gemini benchmark on coding or math or reasoning. None of them tell you whether the model can actually make a decision under uncertainty, where the answer isn't in the training data and the result shows up two weeks later in a P&L.\nSo I built a different kind of eval. Seven frontier LLMs, $100,000 of paper capital each, identical tools, identical prompts, identical data. Every Monday they pick stocks. The market grades them.\nThe project is 1rok. Live leaderboard: investingbench.vercel.app. The clock started January 20, 2026.\nEach model gets its own isolated Alpaca paper account. Same tool registry, same prompts, same screener output. The LLM is the only variable.\nI want to be upfront. I don't think a weekly LLM-driven portfolio is going to beat the S&P. If it could, hedge funds would already be doing it. Some are; results so far are mixed.\nThe point of 1rok isn't alpha. It's that \"which model should I use\" is the most-asked question in AI engineering, and most of the answers are vibes. Coding evals are saturated. Math benchmarks get gamed. I wanted a downstream task where the model has to plan, call tools, synthesize conflicting signals, and commit to a decision, with an objective scoreboard at the end.\nStock picking happens to fit. The fact that everyone has an opinion about it is a bonus.\nEvery Monday at 9:45 ET, a cron fires and kicks off one run per model in parallel. Each run is 10 agents in 4 stages:\nWalking through it:\nThe composite formula lives in one place and looks like this:\ncomposite =\nfundamental * 0.20 // business quality\n+ valuation * 0.20 // price discipline\n+ (100 - risk) * 0.20 // capital preservation (inverted)\n+ technical * 0.15\n+ catalyst * 0.15\n+ sentiment * 0.10\nRisk is inverted on purpose. A high-conviction buy with high tail risk should be smaller, not bigger. The constructor caps any single position at 40%, holds at most 8 names, and won't let cash run above 15%.\nHow agents actually get data. Each pipeline run spins up its own tool registry. There are ~32 tools across 8 groups: market overview, stock data, screening, technicals, options, earnings, portfolio, web search. An agent calls listTools\nto see its slice (the Macro agent gets different tools than the Risk agent), then callTool(name, args)\nreturns typed JSON from a handler that knows how to talk to Alpaca, Yahoo Finance, FRED, or Tavily. Retries, rate limits, and circuit breaking live in the handler layer, so agents never have to deal with a 429 or a flaky socket mid-thought.\nTwo commands, never one. run\nproduces a portfolio-construction JSON artifact. execute\nreads the artifact and places orders. They're always separate.\nbun run 1rok -- run --model gpt-5.5\nbun run 1rok -- execute ./results/openai/gpt-5.5/portfolio-2026-04-16.json\nrun\nnever touches a broker. --live\nis the only path to real order placement; without it, everything goes to paper-api.alpaca.markets\n. This means I can re-run any model on last week's data without accidentally trading, and I can audit exactly what the model decided before a single order leaves the box.\nOpen questions I'm watching:\nI don't have answers yet. The whole experiment is about not having answers yet.\nbun run 1rok -- run --model <id>\n.Star the repo if you want milestones. I'll write up findings as the leaderboard separates.", "url": "https://wpnews.pro/news/which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out", "canonical_source": "https://dev.to/achaljhawar/which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out-1hco", "published_at": "2026-05-20 19:01:47+00:00", "updated_at": "2026-05-20 19:34:06.102019+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "research", "products"], "entities": ["1rok", "Alpaca", "S&P", "GPT", "Claude", "Gemini"], "alternates": {"html": "https://wpnews.pro/news/which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out", "markdown": "https://wpnews.pro/news/which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out.md", "text": "https://wpnews.pro/news/which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out.txt", "jsonld": "https://wpnews.pro/news/which-llm-is-the-best-stock-picker-i-built-a-benchmark-to-find-out.jsonld"}}