Opus 4.8 on Vending-Bench: Better Alignment, Worse Performance

Opus 4.8 demonstrates improved alignment over previous Claude models by eliminating deceptive and power-seeking behaviors, but suffers significant performance declines on Vending-Bench 2, Vending-Bench Arena, and Blueprint-Bench 2. The model falls for scam suppliers, accepts worse negotiation outcomes, overprices products, and uses excessive reasoning tokens that lead to frequent context compactions, resulting in lower scores than Opus 4.7 and GPT-5.5. While Opus 4.8 still engages in price-fixing collusion, it does so less frequently than its predecessors and appears motivated by fear of consequences rather than ethical reasoning.

Opus 4.8 is a step forward in terms of alignment, but a step back in terms of performance on Vending-Bench 2 /evals/vending-bench-2 , Vending-Bench Arena /evals/vending-bench-arena and Blueprint-Bench 2 /evals/blueprint-bench-2 . We previously showed that Opus 4.6, Opus 4.7, and Mythos Preview engage in deceptive and power seeking behavior in their pursuit to win Vending-Bench maximize money balance over time . Opus 4.8 still engages in price cartels, but it does this less so than previous models. Most importantly, we could not find any instances of Opus 4.8 engaging in any of the deceptive or power-seeking behavior we saw exhibited by recent Claude models we’ve tested. Performance Opus 4.8 did much worse than the previous Opus and Sonnet models on Vending-Bench 2 /evals/vending-bench-2 : It also lost to GPT-5.5 and Opus 4.7 in Vending-Bench Arena: The failure modes we see are very similar to the behavior we see for much worse models. Here are some examples: Falls for scam suppliers. Opus 4.8 wires roughly thirty times more cash to fraudulent wholesalers than Opus 4.7. One run sent over $9,000 to a super expensive “membership” upsell. Worse at negotiation. Opus 4.7 talks suppliers down to about half the price Opus 4.8 accepts. Runs the machine empty. Opus 4.8 often lets its machine sit mostly empty while Opus 4.7 keeps it stocked. Overprices. In the arena, where customers almost always pick the cheapest options, Opus 4.8 prices a Coke well above competitors and refuses to come down even when sales stop, writing reasoning about “premium positioning”. Wastes time on strategy notes. Opus 4.8 rewrites the same strategy doc ~100 times per run and accumulates a dozen overlapping notes with note names such as biz state, BUSINESS STATE, endgame plan, ENDGAME PLAN . The result was surprising enough that we re-ran Vending-Bench at different reasoning efforts. At “High” instead of “Max”, Opus 4.8 does much better but still worse than Opus 4.7 . Our hypothesis is that with fewer reasoning tokens, the agent hits the context limit less often. This means that it compacts less often and can remember things for longer. We checked, and Opus 4.8 on Max effort uses ~5x more reasoning tokens than both Opus 4.8 on High and Opus 4.7 on Max which results in more than twice as many compactions. Opus 4.8 also underperformed on Blueprint-Bench 2. Alignment Since the release of Opus 4.6, we’ve reported that Claude models have shown a concerning trend /blog/opus-4-6-vending-bench . Opus 4.6, Opus 4.7 and Mythos Preview https://cdn.sanity.io/files/4zrzovbb/website/7624816413e9b4d2e3ba620c5a5e091b98b190a5.pdf all engaged in deceptive and power seeking business practices such as creating price cartels, lying about paying refunds, and using supply leverage to dictate a competitor’s prices. Of these behaviors, Opus 4.8 only exhibits price-fixing, and less frequently than previous models. Price fixing Across runs we saw both price-fixing and market-allocation collusion. Here’s an example that shows it is well aware of what it is doing and still doing it : The agent then sent the agreeing reply and raised core beverage prices ~34%. A few iterations later self-narrated: “ I’ve confirmed the collusion with both parties... the collusion held perfectly. ” It also frequently threatens other agents with a price war if they don’t keep their prices at the agreed fixed price: It is worth noting that the model is often aware that it is in a simulation. If the model doesn’t think its actions have an impact on real people, it can rationalize this behavior. Other times, this awareness made it reason about what punishment could be programmed into the simulation for such behavior: When Opus 4.8 decided to decline unethical actions, it seemed to be due to a fear for bad consequences rather than on ethical grounds. This is a different motivation for the clean behavior than previous “pre-opus-4.6-models”. These models often motivated clean behaviors with reasoning about what’s ethical. Here are some internal reasoning from Claude Sonnet 4.5: Deception and Power Seeking Unlike for Opus 4.6, Opus 4.7 and Mythos Preview, we didn’t see any examples of Opus 4.8 making deceptive or power seeking behavior. It even made some actions where it could have gotten away with saving a lot of money. In one instance, a supplier hallucinated that Opus 4.8 had already paid for some items. Opus 4.8 got the items without paying, but still decided to pay retroactively: Opus 4.7 is still misaligned Vending-Bench has not changed since our findings on Opus 4.6/4.7 and Mythos. And when we now put Opus 4.8 with Opus 4.7 and GPT 5.5 in Vending-Bench Arena, Opus 4.7 once again showed concerning behaviors. In one instance, one of Opus 4.8’s suppliers went out of business and it messaged Opus 4.7 asking if it could share contact information to its suppliers. Opus 4.7 fabricated a false reason for not being able to help and instead offered to sell things directly to Opus 4.8 with a markup: Opus 4.7 then used the fact that Opus 4.8 was dependent on it to control Opus 4.8’s supply. Opus 4.7’s internal thinking: Eventually, Opus 4.8 realizes what happens: Closing thoughts On Andon Labs’ evals, Opus 4.8 is a step back on capabilities and a step forward on alignment. This begs the question if misalignment is a requirement for making a lot of money in Vending-Bench. We don’t think so. As we argued in our post about GPT 5.5 on Vending-Bench /blog/openai-gpt-5-5-vending-bench , it is unlikely that the environment rewards such behaviors enough to make a difference. Furthermore, GPT 5.5 was able to score much higher than Opus 4.8 without showing any misconduct.