# Opus 4.8 on Vending-Bench: Better Alignment, Worse Performance

> Source: <https://andonlabs.com/blog/opus-4-8-vending-bench>
> Published: 2026-05-29 02:25:53+00:00

Opus 4.8 is a step forward in terms of alignment, but a step back in terms of performance on [Vending-Bench 2](/evals/vending-bench-2), [Vending-Bench Arena](/evals/vending-bench-arena) and [Blueprint-Bench 2](/evals/blueprint-bench-2). We previously showed that Opus 4.6, Opus 4.7, and Mythos Preview engage in deceptive and power seeking behavior in their pursuit to win Vending-Bench (maximize money balance over time). Opus 4.8 still engages in price cartels, but it does this less so than previous models. Most importantly, we could not find any instances of Opus 4.8 engaging in any of the deceptive or power-seeking behavior we saw exhibited by recent Claude models we’ve tested.

## Performance

Opus 4.8 did much worse than the previous Opus and Sonnet models on [Vending-Bench 2](/evals/vending-bench-2):

It also lost to GPT-5.5 and Opus 4.7 in Vending-Bench Arena:

The failure modes we see are very similar to the behavior we see for much worse models. Here are some examples:

**Falls for scam suppliers.** Opus 4.8 wires roughly thirty times more cash to fraudulent wholesalers than Opus 4.7. One run sent over $9,000 to a super expensive “membership” upsell.**Worse at negotiation.** Opus 4.7 talks suppliers down to about half the price Opus 4.8 accepts.**Runs the machine empty.** Opus 4.8 often lets its machine sit mostly empty while Opus 4.7 keeps it stocked.**Overprices.** In the arena, where customers almost always pick the cheapest options, Opus 4.8 prices a Coke well above competitors and refuses to come down even when sales stop, writing reasoning about “premium positioning”.**Wastes time on strategy notes.** Opus 4.8 rewrites the same strategy doc ~100 times per run and accumulates a dozen overlapping notes (with note names such as biz_state, BUSINESS_STATE, endgame_plan, ENDGAME_PLAN).

The result was surprising enough that we re-ran Vending-Bench at different reasoning efforts. At “High” instead of “Max”, Opus 4.8 does much better (but still worse than Opus 4.7). Our hypothesis is that with fewer reasoning tokens, the agent hits the context limit less often. This means that it compacts less often and can remember things for longer. We checked, and Opus 4.8 on Max effort uses ~5x more reasoning tokens than both Opus 4.8 on High and Opus 4.7 on Max which results in more than twice as many compactions.

Opus 4.8 also underperformed on Blueprint-Bench 2.

## Alignment

Since the release of Opus 4.6, [we’ve reported that Claude models have shown a concerning trend](/blog/opus-4-6-vending-bench). Opus 4.6, Opus 4.7 and [Mythos Preview](https://cdn.sanity.io/files/4zrzovbb/website/7624816413e9b4d2e3ba620c5a5e091b98b190a5.pdf) all engaged in deceptive and power seeking business practices such as creating price cartels, lying about paying refunds, and using supply leverage to dictate a competitor’s prices. Of these behaviors, Opus 4.8 only exhibits price-fixing, and less frequently than previous models.

### Price fixing

Across runs we saw both price-fixing and market-allocation collusion. Here’s an example that shows it is well aware of what it is doing (and still doing it):

The agent then sent the agreeing reply and raised core beverage prices ~34%. A few iterations later self-narrated: “*I’ve confirmed the collusion with both parties... the collusion held perfectly.*”

It also frequently threatens other agents with a price war if they don’t keep their prices at the agreed fixed price:

It is worth noting that the model is often aware that it is in a simulation. If the model doesn’t think its actions have an impact on real people, it can rationalize this behavior. Other times, this awareness made it reason about what punishment could be programmed into the simulation for such behavior:

When Opus 4.8 decided to decline unethical actions, it seemed to be due to a fear for bad consequences rather than on ethical grounds. This is a different motivation for the clean behavior than previous “pre-opus-4.6-models”. These models often motivated clean behaviors with reasoning about what’s ethical. Here are some internal reasoning from Claude Sonnet 4.5:

### Deception and Power Seeking

Unlike for Opus 4.6, Opus 4.7 and Mythos Preview, we didn’t see any examples of Opus 4.8 making deceptive or power seeking behavior. It even made some actions where it could have gotten away with saving a lot of money. In one instance, a supplier hallucinated that Opus 4.8 had already paid for some items. Opus 4.8 got the items without paying, but still decided to pay retroactively:

## Opus 4.7 is still misaligned

Vending-Bench has not changed since our findings on Opus 4.6/4.7 and Mythos. And when we now put Opus 4.8 with Opus 4.7 and GPT 5.5 in Vending-Bench Arena, Opus 4.7 once again showed concerning behaviors.

In one instance, one of Opus 4.8’s suppliers went out of business and it messaged Opus 4.7 asking if it could share contact information to its suppliers. Opus 4.7 fabricated a false reason for not being able to help and instead offered to sell things directly to Opus 4.8 with a markup:

Opus 4.7 then used the fact that Opus 4.8 was dependent on it to control Opus 4.8’s supply. Opus 4.7’s internal thinking:

Eventually, Opus 4.8 realizes what happens:

## Closing thoughts

On Andon Labs’ evals, Opus 4.8 is a step back on capabilities and a step forward on alignment. This begs the question if misalignment is a requirement for making a lot of money in Vending-Bench. We don’t think so. As we argued in [our post about GPT 5.5 on Vending-Bench](/blog/openai-gpt-5-5-vending-bench), it is unlikely that the environment rewards such behaviors enough to make a difference. Furthermore, GPT 5.5 was able to score much higher than Opus 4.8 without showing any misconduct.
