Which LLM is the best stock picker? I built a benchmark to find out.

A benchmark called "1rok" that evaluates seven frontier large language models (LLMs) as stock pickers, giving each $100,000 in paper capital and identical tools to select stocks weekly. The project, which started on January 20, 2026, aims to test models on decision-making under uncertainty rather than on standard coding or math benchmarks, with results tracked on a live leaderboard. The author notes that the goal is not to beat the S&P 500 but to provide an objective, downstream task for comparing LLM performance in planning, tool use, and committing to decisions.

Every other week there's a new GPT-vs-Claude-vs-Gemini benchmark on coding or math or reasoning. None of them tell you whether the model can actually make a decision under uncertainty, where the answer isn't in the training data and the result shows up two weeks later in a P&L. So I built a different kind of eval. Seven frontier LLMs, $100,000 of paper capital each, identical tools, identical prompts, identical data. Every Monday they pick stocks. The market grades them. The project is 1rok. Live leaderboard: investingbench.vercel.app. The clock started January 20, 2026. Each model gets its own isolated Alpaca paper account. Same tool registry, same prompts, same screener output. The LLM is the only variable. I want to be upfront. I don't think a weekly LLM-driven portfolio is going to beat the S&P. If it could, hedge funds would already be doing it. Some are; results so far are mixed. The point of 1rok isn't alpha. It's that "which model should I use" is the most-asked question in AI engineering, and most of the answers are vibes. Coding evals are saturated. Math benchmarks get gamed. I wanted a downstream task where the model has to plan, call tools, synthesize conflicting signals, and commit to a decision, with an objective scoreboard at the end. Stock picking happens to fit. The fact that everyone has an opinion about it is a bonus. Every Monday at 9:45 ET, a cron fires and kicks off one run per model in parallel. Each run is 10 agents in 4 stages: Walking through it: The composite formula lives in one place and looks like this: composite = fundamental 0.20 // business quality + valuation 0.20 // price discipline + 100 - risk 0.20 // capital preservation inverted + technical 0.15 + catalyst 0.15 + sentiment 0.10 Risk is inverted on purpose. A high-conviction buy with high tail risk should be smaller, not bigger. The constructor caps any single position at 40%, holds at most 8 names, and won't let cash run above 15%. How agents actually get data. Each pipeline run spins up its own tool registry. There are ~32 tools across 8 groups: market overview, stock data, screening, technicals, options, earnings, portfolio, web search. An agent calls listTools to see its slice the Macro agent gets different tools than the Risk agent , then callTool name, args returns typed JSON from a handler that knows how to talk to Alpaca, Yahoo Finance, FRED, or Tavily. Retries, rate limits, and circuit breaking live in the handler layer, so agents never have to deal with a 429 or a flaky socket mid-thought. Two commands, never one. run produces a portfolio-construction JSON artifact. execute reads the artifact and places orders. They're always separate. bun run 1rok -- run --model gpt-5.5 bun run 1rok -- execute ./results/openai/gpt-5.5/portfolio-2026-04-16.json run never touches a broker. --live is the only path to real order placement; without it, everything goes to paper-api.alpaca.markets . This means I can re-run any model on last week's data without accidentally trading, and I can audit exactly what the model decided before a single order leaves the box. Open questions I'm watching: I don't have answers yet. The whole experiment is about not having answers yet. bun run 1rok -- run --model <id .Star the repo if you want milestones. I'll write up findings as the leaderboard separates.