Only three AI models finished above starting capital in a 500-day startup survival test

Princeton University researchers developed CEO-Bench, a 500-day startup survival test for AI agents, finding that only three AI models finished above starting capital while a simple rule-based heuristic outperformed most. The benchmark simulates running a fictional software company, requiring long-term strategic decisions under uncertainty, highlighting a gap in current AI capabilities for complex, real-world tasks.

Only three AI models finished above starting capital in a 500-day startup survival test Researchers at Princeton University built CEO-Bench, a test where AI agents have to run a fictional software company for 500 simulated days. Most current models go broke, and a simple rule-based heuristic with no AI beats nearly all of them. AI agents are getting increasingly good at narrow tasks: fixing a bug, following a service policy in a conversation, or completing a web-based workflow. These tasks share a simple structure, according to the Princeton study: the agent gets a clear goal, acts briefly, and receives quick feedback. Many important real-world tasks look nothing like that. They involve long chains of decisions under uncertainty, where you have to set priorities, allocate limited resources, read noisy signals, and adapt to changing conditions. To test exactly these skills, the researchers developed CEO-Bench https://arxiv.org/abs/2606.18543 . The benchmark simulates a realistic example of this kind of long-horizon task: running a startup for 500 simulated days. The researchers point to a famous example: in 1997, Apple was 90 days from bankruptcy. Steve Jobs drew a simple two-by-two grid—consumer and pro, desktop and portable—and decided Apple would only build products for those four quadrants. The iMac, iPod, and iPhone followed. This type of strategic steering intelligence is fundamentally different from what AI agents do today, the authors argue. Agents are getting better at individual tasks fast. But steering an entire organization toward long-term goals? That's a different problem entirely. CEO-Bench is a first attempt at measuring exactly this "steering intelligence." An AI CEO for a fictional software company In CEO-Bench, an agent runs a made-up subscription software company called NovaMind. It starts with zero customers and one million dollars in the bank. Performance is measured by remaining cash at the end. If the balance drops below zero even once, the company is bankrupt and the simulation ends. The agent controls the company through a Python API with 34 tools and a database of 19 tables. Instead of just issuing individual commands, it writes its own code, queries the database with SQL, and builds custom workflows from the results. That puts it in front of the same challenges a human CEO would face, the researchers say. There's a lot to decide: pricing and tiers, ad spend across channels, product quality and R&D, infrastructure capacity and customer support, plus multi-round negotiations with enterprise clients. On top of that, there's a simulated social network where the agent can read complaints, competitor news, and economic trends and post itself. Delayed feedback and hidden variables make the test hard What makes the task hard is time and uncertainty. Decisions play out on realistic business timelines: revenue only arrives at billing dates, R&D projects take days to weeks, and mistakes often don't show up until later through churn or damaged reputation. Costs hit right away. The agent has to spend money whose payoff might not show up for weeks. Much of the company's state stays hidden. The agent can't directly see customer satisfaction, willingness to pay, or minimum quality expectations. It has to piece these together from noisy signals like cancellations, support tickets, or reactions on the social network. The simulation models 26 customer segments and individual customers, each with their own budgets, price sensitivities, and expectations. The world keeps changing, too. Competitors periodically raise customer quality expectations, preferences shift over time, and a simulated business cycle affects demand and willingness to pay, so the agent has to keep adjusting. The researchers deliberately chose fixed, transparent rules rather than a language model as referee. They wanted to avoid a weakness they see in Vending-Bench https://andonlabs.com/evals/vending-bench-2 , a test with a simulated vending machine https://the-decoder.de/anthropic-experiment-zeigt-staerkere-ki-modelle-verhandeln-bessere-preise-ohne-dass-es-jemand-merkt/ : there, an AI-simulated supplier can reward an agent for unrealistic verbal promises. Most models go bankrupt Of fourteen tested models, most fail the task. Nearly all can generate valid commands and database queries, but none can maintain a coherent strategy over time. Many go bankrupt before the simulation ends. Only three models finish their best run above the starting capital of one million dollars: Claude Fable 5 at $47.15 million, Claude Opus 4.8 at $27.8 million, and GPT-5.5 at $21.3 million. Claude Fable 5 is the only model that lands above starting capital in more than one run. There's a caveat, though. One Fable 5 run aborted because the model refused to continue, and in the other two, some requests fell back to Opus 4.8. GPT-5.5 went bankrupt in two of its three runs. The most telling comparison is with a simple rule-based heuristic that never calls a language model at all. It sets fixed prices, quotas, and tiers, focuses advertising and targeted development on a small set of customer segments, and adjusts capacity based on recent usage. This heuristic reaches $15.76 million, beating every model except Fable 5, Opus 4.8, and GPT-5.5. The researchers also roughly estimate the upper bound of achievable final cash at around $2.2 billion. Even the best agents fall far short. The test is nowhere near maxed out, the authors say. Exploration beats caution Analyzing the decision trajectories reveals clear behavioral differences. GPT-5.5 and Claude Opus 4.8 keep trying new strategies as conditions change, whether that means ramping up customer acquisition, adjusting tiers, or shifting support and R&D budgets. Claude Opus 4.7, by contrast, mostly responds to setbacks by cutting costs and preserving cash. This passive approach lets the model survive to the end but prevents it from turning a profit. Interestingly, Opus 4.8 and GPT-5.5 reach similar final results through very different paths: Opus 4.8 acquires more customers early on but drops to zero customers mid-simulation, while GPT-5.5 holds its customer base throughout. Both write surprisingly sophisticated code. Opus 4.8 builds its own internal simulation that models customer cohorts to predict future cash flow. GPT-5.5 digs through negotiation history in the database to uncover hidden customer preferences. The researchers measure four capabilities that correlate with success: - uncovering hidden information, like which ad channel works best for a given customer segment, - predicting the future, measured by error in four-week cash forecasts, - adapting quickly to change, measured by how fast a model notices a competitor's move, - and planning ahead, measured partly by how often if-then scenarios appear in the agent's notes. On all four points, Opus 4.8 and GPT-5.5 score above the average of the other models. The tool environment matters too Another finding concerns the software environment agents use to act. The researchers also tested Claude Opus 4.7 with Claude Code and GPT-5.5 with Codex, two popular coding assistants. In both cases, the agents acted far less often and performed worse. The researchers suspect the system prompts in these tools, which are tuned for software development, are the cause. Shortening the time horizon doesn't solve the problem either. When the simulation is compressed to 50 days, only GPT-5.5 manages to finish with a profit. Most models, the researchers conclude, remain weak at coordinating decisions even toward a short-term goal. The authors acknowledge limits in their setup. The product is represented by a single quality score because they found no reliable way to evaluate qualitative product changes. Compliance, security, and fundraising are left out to keep each run economically feasible. Still, CEO-Bench exposes a gap between the local tool competence of today's models and the ability to connect actions over long time horizons into a coherent strategy, they say. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Read on for the full picture.Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder