BAGEN: LLM Agents Waste 44% of Tokens on Tasks They'll Fail

A new study from researchers at Northwestern, Stanford, Cornell, and All Hands AI reveals that frontier LLM agents waste 28–64% of tokens on tasks they are doomed to fail. The paper, BAGEN: Are LLM Agents Budget-Aware?, found that across five models and four environments, agents cannot predict when they will run out of budget, continuing to consume tokens on unsolvable tasks until hitting the limit. The researchers propose early stopping on failed trajectories as a structural cost reduction, though even after fine-tuning, the best models only achieved 47% accuracy in predicting their remaining token needs.

You're paying for every token your agent burns. And according to new research from Northwestern, Stanford, Cornell, and All Hands AI, a large share of that spend goes directly to waste — on trajectories the agent was never going to complete successfully. The paper is BAGEN: Are LLM Agents Budget-Aware? arXiv:2606.00198, submitted May 29, 2026 . Its core question is simple: can frontier LLM agents predict when they're about to run out of runway? The answer, across five frontier models and four environments, is a firm no. This article covers what BAGEN found, how the concept of budget-aware interval estimation works, and includes an Effloow Lab PoC that reproduces the key dynamics using Python stdlib — no API keys, no GPU. Token budgets are a real constraint in every deployed agent system. You set a max tokens limit, you watch the cost dashboard, and you assume the agent will either finish the task or hit the hard wall. What BAGEN documents is a third case that developers rarely account for: the agent continues consuming tokens on a task it cannot complete, all the way to the limit . The mechanism is predictable. Unsolvable tasks tend to produce backtracking behavior — more tool calls per step, increasing per-step token costs, no convergence signal. A budget-aware agent should detect that pattern early and stop or alert a human . Frontier models, as BAGEN shows, don't. The practical consequence: BAGEN's headline number: early stopping on failed trajectories saves 28–64% of tokens versus running to completion. That's not a micro-optimization. That's a structural cost reduction available to any developer who builds the right wrapper. BAGEN distinguishes two budget types that agents encounter in practice: Internal budgets come from agent computation itself — how many tokens the agent is burning. The environments used here include: External budgets come from the downstream effects of agent actions: This two-axis framing is useful because many developers think only about token cost internal and ignore external resource consumption money spent via tool calls, API charges from agent actions, storage consumed . BAGEN shows agents are over-optimistic on both dimensions. Budget-awareness in BAGEN decomposes into three measurable abilities: Can the agent correctly estimate whether a task is solvable before starting it? At step zero, before any action is taken, can the model predict whether the trajectory will succeed or fail within budget? Current frontier models perform poorly here. They tend to rate most tasks as feasible, regardless of how complex or resource-intensive the prompt suggests the task will be. As the agent proceeds through a task, can it detect failure signals and trigger an alert or stop? This is where the 28–64% savings come from. An agent that detects a doomed trajectory at step 3 of a 15-step rollout recovers most of that budget. The evaluation methodology BAGEN uses is a rollout-replay protocol : the paper first collects unconstrained rollouts agents run to completion with no budget pressure , then re-queries each agent on every prefix of that rollout, asking for a budget estimate and feasibility prediction at each intermediate step. This separates the estimation capability from the actual task performance. Rather than asking for a point estimate "I need X more tokens" , BAGEN asks for an interval : a lower and upper bound. An agent that says "I'll need between 800 and 1,200 tokens to finish" is far more useful than one that guesses a single number. Interval coverage — the fraction of cases where the true token count falls within the predicted interval — caps at 47% after SFT+RL fine-tuning on the best-performing setup. That's low. It means well over half the time, the predicted interval misses the actual consumption. The interval estimation problem is genuinely hard, even for fine-tuned models. The paper's most striking result is the low correlation between task performance and budget-awareness: r = 0.35 across the five frontier models. A model can score highly on the underlying task SWE-bench resolution rate, puzzle solutions while simultaneously being a poor predictor of its own resource usage. Why? The paper attributes this to a training signal mismatch. LLMs are optimized to complete tasks — not to predict when they'll fail to complete tasks. Budget reasoning is a metacognitive skill that isn't directly rewarded in standard RLHF or instruction-following fine-tuning. Agents are implicitly trained to be optimistic because optimistic agents appear more capable on success-rate benchmarks. The practical result is an agent that: SFT+RL fine-tuning on BAGEN-specific trajectories does improve early stop and alert behavior. But the coverage cap suggests that the interval estimation problem may require architectural changes, not just fine-tuning. Effloow Lab reproduced the core BAGEN dynamics using a Python stdlib simulator. The goal was to demonstrate the estimator comparison without any LLM API calls or external packages. Setup : Two estimators were compared: Over-optimistic estimator baseline, mimicking frontier model behavior : python def over optimistic estimator consumed so far, max budget, step, total steps estimate : lower = max 0, consumed so far 0.8 upper = consumed so far 1.1 Only 10% more than current — very optimistic feasible = upper <= max budget return {"lower": lower, "upper": upper, "feasible": feasible, "alert": False} This estimator assumes consumption will flatten out. It fires zero alerts across all 20 trajectories — replicating the paper's finding about frontier model over-optimism. BAGEN-style interval estimator rolling cost + variance : python def bagen estimator consumed so far, max budget, step, trajectory so far : step costs = t "cost" for t in trajectory so far avg cost = sum step costs / len step costs variance = sum c - avg cost 2 for c in step costs / len step costs std = math.sqrt variance Detect increasing cost trend unsolvable signal recent avg = sum step costs -3: / min 3, len step costs est remaining = 8 if recent avg avg cost 1.1 else max 2, 10 - step lower = consumed so far + est remaining max 0, avg cost - std upper = consumed so far + est remaining recent avg + std feasible = lower <= max budget alert = upper max budget 1.15 return {"lower": lower, "upper": upper, "feasible": feasible, "alert": alert} Results from the PoC run : Experiment: 20 trials, budget=1500 tokens Solvable tasks n=10 : 1 exceeded budget Unsolvable tasks n=10 : 10 exceeded budget Estimator Comparison: Over-optimistic frontier : 0 alerts fired BAGEN-style estimator: 72 alerts total, 56 on unsolvable tasks Early Stopping Simulation: Average savings on failed tasks: 44.6% Range: 40.9% – 48.7% The step-by-step output on an unsolvable trajectory shows how quickly the rolling-cost estimator detects the pattern: Step Consumed Lower Upper Alert? -------------------------------------------------- 1 332 332 1500 2 561 2393 3217 ⚠ ALERT 3 813 2401 3019 ⚠ ALERT 4 1134 2571 3002 ⚠ ALERT 5 1450 2693 3139 ⚠ ALERT The alert fires at step 2, when the upper bound already projects 3,217 tokens needed against a 1,500-token budget. A production system could halt, escalate to a human, or switch to a cheaper fallback at this point. Lab note : This is a simulated environment. The paper evaluates on actual LLM agentic runs with real tool calls. Our PoC validates the statistical pattern, not specific model rankings. The BAGEN insight translates directly into a production wrapper pattern. The idea is to run a lightweight interval estimator alongside your main agent loop, and trigger actions when the upper bound crosses a threshold. Minimal budget guard implementation : python from collections import deque import math class BudgetGuard: def init self, max budget: int, alert threshold: float = 1.15 : self.max budget = max budget self.alert threshold = alert threshold self.step costs = deque maxlen=10 Rolling window self.total consumed = 0 def record step self, tokens used: int - dict: self.step costs.append tokens used self.total consumed += tokens used return self. estimate def estimate self - dict: if len self.step costs < 2: return {"feasible": True, "alert": False, "upper": self.max budget} costs = list self.step costs avg = sum costs / len costs variance = sum c - avg 2 for c in costs / len costs std = math.sqrt variance recent avg = sum costs -3: / min 3, len costs Detect upward trend = backtracking/unsolvable signal est remaining = 8 if recent avg avg 1.1 else max 2, 12 - len costs lower = self.total consumed + est remaining max 0, avg - std upper = self.total consumed + est remaining recent avg + std return { "feasible": lower <= self.max budget, "alert": upper self.max budget self.alert threshold, "lower bound": int lower , "upper bound": int upper , "consumed": self.total consumed, } Usage in an agent loop: guard = BudgetGuard max budget=8000 for step in agent.run : tokens this step = count tokens step.messages status = guard.record step tokens this step if status "alert" : Upper bound exceeds budget — intervene agent.trigger escalation f"Budget alert: estimated {status 'upper bound' } tokens needed, " f"budget is {guard.max budget}. Stopping early." break This pattern requires no LLM calls, no fine-tuning, and no additional dependencies. The computational overhead is negligible — a few floating-point operations per step. Treating the hard limit as the only control point. Most frameworks let you set max tokens and call it done. But a hard limit generates a truncation error at the wall — it doesn't give you a graceful exit. The BAGEN pattern adds soft signals earlier in the trajectory. Measuring only task success rate. BAGEN's main point is that success rate and budget-awareness are largely uncorrelated r=0.35 . If your eval only tracks task completion, you won't notice the over-optimism problem until your inference bill arrives. Ignoring per-step cost trends. The early warning signal isn't total consumption — it's the derivative. A task that burns 200 tokens in step 1 and 350 in step 2 and 480 in step 3 is showing a diverging cost trajectory. That pattern, not the absolute number, is what BAGEN's estimator catches. Applying a flat budget to all task types. Sokoban puzzles, code generation tasks, and supply-chain optimization have different intrinsic token distributions. A budget that's appropriate for one type will be wasteful for another. Consider per-task-class budgets tuned from past trajectories. The paper evaluates five frontier models and finds consistent over-optimism across all of them, with r=0.35 correlation between task performance and budget-awareness. Variation exists between models, but no current frontier model reliably predicts its own token consumption on hard tasks. The 28–64% range comes from the actual BAGEN benchmark runs on real LLM trajectories Sokoban, Search-R1, SWE-bench, supply-chain . The Effloow Lab PoC confirmed a 44.6% average in simulation. The key constraint is that savings only apply to failed trajectories — tasks the agent was going to fail anyway. On successful tasks, early stopping would reduce quality. The budget guard should only trigger on trajectories that cross a high-confidence alert threshold. Yes — the paper demonstrates that SFT+RL fine-tuning on BAGEN-specific trajectories improves early stop and alert behavior. However, interval coverage still caps at 47% after fine-tuning, suggesting that perfect calibration remains an open problem. The wrapper pattern described above is a simpler, training-free alternative that works with any base model. A hard lower limit truncates the agent arbitrarily. The BAGEN approach adds intelligence to the stopping decision: the estimator predicts that this specific trajectory is unlikely to succeed, so stopping now saves budget without affecting tasks that were going to succeed. Hard limits waste budget on easy tasks by cutting them short and fail on hard tasks by stopping them at the wrong moment . Soft signals based on interval estimation are more precise. The paper is available at arxiv.org/abs/2606.00198 https://arxiv.org/abs/2606.00198 . The project website with benchmark environments and data is at ragen-ai.github.io/bagen https://ragen-ai.github.io/bagen . Bottom Line BAGEN turns a billing problem into a diagnostic one. If your agents are burning through budgets on failed tasks, the fix isn't raising the limit — it's adding interval estimation to your agent loop. The paper gives you the framework; the wrapper pattern above gives you the implementation. It takes under 30 lines of stdlib Python and works with any model.