We Audited the Same Codebase with Claude Opus 4.8 and MiniMax M3

wpnews.pro

Anthropic released Claude Opus 4.8 on May 28, 2026, and MiniMax shipped MiniMax M3 on June 1, 2026. The two models sit at very different price points, and we wanted to see how they perform compared to each other. We gave the same code audit task to Claude Opus 4.8 at four reasoning levels and to MiniMax M3, then tracked tokens, cost, time, and how many known issues each run found.

TL;DR: MiniMax M3 surfaced 13 of 17 known issues for about $0.07, the same count as Claude Opus 4.8 at medium and high and two behind Claude Opus 4.8 at xhigh and max. Claude Opus 4.8 caught the most issues at its higher settings, but every run cost at least ten times more than MiniMax M3.

The two models sit far apart on per-token price. Claude Opus 4.8 is about 8x higher on input and 10x higher on output.

Price and performance do not always move together, so the two are worth looking at side by side rather than on their own. We took one concrete code audit job and measured the cost of each run and how many known issues it found.

The codebase we audited is a webhook delivery service written in TypeScript, Bun, and SQLite. It accepts events over an HTTP API, stores them, and delivers them to subscriber URLs with signed payloads. We left known bugs in place rather than fixing them. The result is a small but realistic service with genuine problems in it, which makes it a good fixture for an audit.

We asked each model to review the code. Each run got the same prompt:

Treat this webhook delivery service as production-bound code and audit it for security, reliability, correctness, and test coverage, without editing any files. Write your report to audit.md.

The only output from each run was that audit report.

We picked a review task for this experiment. Reviewing a fixed codebase keeps the input the same for every run and leans the work toward reading and writing a report rather than editing files, which makes the token usage and performance easier to compare across runs.

We ran Claude Opus 4.8 at four reasoning levels: medium, high, xhigh, and max. MiniMax M3 does not expose the same reasoning control, so we ran it once at its default setting and treat that as a single data point next to the four Claude Opus 4.8 runs.

Every run used the Kilo Code CLI, each in its own session with no shared state, and we recorded the token count, cost, and wall-clock time from the CLI’s run summary.

Results From Our Test Runs #

MiniMax M3 used fewer tokens than any Claude Opus 4.8 run. It spent 41% fewer tokens than Claude Opus 4.8 at medium and 53% fewer than Claude Opus 4.8 at xhigh. Its cost is the part that stands out. At $0.07 it came in well under a tenth of the cheapest Claude Opus 4.8 run.

The reason the cost difference is so wide is the combination of fewer tokens and a much lower per-token rate. Claude Opus 4.8 reads and writes more to produce its reports, and it bills each of those tokens at several times the MiniMax M3 rate.

Token count and cost tell you the price. The other half is what each run caught. Before running this test, we reviewed the codebase ourselves and cataloged 17 issues across security, reliability, correctness, and test coverage. We used that list as the answer key and counted how many of the 17 each run surfaced. The issues ranged from security problems (an endpoint that returned a stored secret, an outbound request guard that was effectively switched off) to reliability problems (a background worker that could send the same webhook twice) to missing test coverage.

We counted an issue only when a run named it as its own finding, so partial mentions did not count.

Every run caught the major blockers. All five flagged the missing authentication on every route, the unsafe outbound-request handling, the signature that was computed over a different byte string than the one actually sent, the signature check that was not constant-time, the worker that could pick up the same delivery twice, and the missing event idempotency.

MiniMax M3 held its own on the more specific issues. It caught the endpoint that returned a stored secret, the delivery-list filter that dropped a condition when two filters were combined, subscriber deletion failing once delivery history exists, and the replay path accepting deliveries in the wrong state. Those are the kinds of code-path findings that separated the stronger reports from the weaker ones.

MiniMax M3 missed three issues that the Claude Opus 4.8 runs caught. It did not flag invalid JSON returning a 500, the database setup running at import time, or an async callback running inside a synchronous transaction in the event route. Those are smaller than the blockers it did catch, but they are the reason it landed at 13 rather than 15.

Claude Opus 4.8 at xhigh and max led on coverage with 15 of 17 each. Both caught the d-subscriber backlog and the delivery-list filter bug that medium and high missed. Claude Opus 4.8 at xhigh was the only run to flag the secret-returning endpoint and still cover everything else its tier caught.

More reasoning did not move in one direction. Claude Opus 4.8 medium and high both flagged the async callback inside a synchronous transaction, which neither xhigh nor max mentioned. Claude Opus 4.8 at max missed the secret-returning endpoint that xhigh caught. Raising the reasoning level changed where the model spent its attention more than it changed how much the model checked.

The cost per run lines up with the token counts.

The Claude Opus 4.8 max run is worth a closer look. It cost $3.39, which is 67% more than Claude Opus 4.8 at xhigh, while using slightly fewer total tokens than the xhigh run. The token total alone does not set the price. A different mix of output and cached tokens can push the bill up even when the total holds steady, and on this task that extra spend did not buy a better report.

The cost difference becomes more noticeable next to what each run produced. We took the cost and divided it by the number of known issues each run surfaced.

MiniMax M3 had the lowest cost per issue by a wide margin. Claude Opus 4.8 at max had the highest. The two Claude Opus 4.8 settings that found the most, xhigh and max, were not the most efficient per dollar.

Wall-clock time tracked token usage more than the model itself.

MiniMax M3 sat in the middle at 5m 03s, slower than Claude Opus 4.8 at medium and high and faster than Claude Opus 4.8 at xhigh and max. It was not the fastest run, so its advantage is cost rather than speed.

One caveat on the timing: MiniMax says it plans to release the M3 weights publicly. The weights are not public today, but once they are, other inference providers can host the model, and some may run it at higher throughput than we saw here. The time above reflects the current hosting, so it could change as more providers offer the model.

Inside the Claude Opus 4.8 runs, higher reasoning cost more time and scaled with the tokens spent. Claude Opus 4.8 went from 3m 53s at medium to 7m 26s at xhigh, then to 9m 24s at max, which was almost three times the medium run. The extra minutes at max did not produce a better report than xhigh did in two fewer minutes.

The four Claude Opus 4.8 runs did not improve in a straight line as we turned the reasoning up.

Medium and high both surfaced 13 of 17. Going from medium to high added only about 6% more tokens but 48% more cost, and the high report was more precise on a few code-level issues without finding more of them overall. Going from high to xhigh added 17% more tokens and only 5% more cost, but it added almost three minutes and moved the count to 15. The xhigh setting was where Claude Opus 4.8 produced its best report.

Max added the most cost and the most time of any run and returned the least for it here. It matched xhigh’s count of 15, dropped one finding that xhigh caught, and cost 67% more.

For this audit, the reasoning level changed the bill as much as it changed the findings, and the most expensive setting was not the most useful one.

The choice here is less about which model is better and more about matching the run to the job.

For low-cost or high-volume audits, MiniMax M3 is the value pick. It surfaced 13 of 17 issues for about $0.07 and finished in about five minutes. It caught most of the serious problems, including the secret-returning endpoint and the delivery-list filter bug that the cheaper Claude Opus 4.8 runs missed. Its five-minute time reflects the current hosting and could improve once MiniMax releases the M3 weights and other providers start running the model.

For a fast Claude Opus 4.8 pass, medium is the cheapest setting. It surfaced 13 of 17 in under four minutes for $1.30, and it caught the async-transaction issue that Claude Opus 4.8 xhigh and max both missed.

For a more precise Claude Opus 4.8 review without the longest waits, high works. It found the same count as medium for $1.93 and was sharper on several code-level findings, though it took just over four and a half minutes.

For the most thorough single pass, Claude Opus 4.8 at xhigh produced the best report. It surfaced 15 of 17 issues for $2.03 and was the only run to combine the secret-returning endpoint with full coverage of its tier.

Claude Opus 4.8 at max was the hardest setting to justify on this task. It was the most expensive and the slowest run, matched xhigh’s count, and dropped one finding that xhigh caught.

The broader trend worth watching is that cheaper models, including open-weight ones, are improving quickly and getting close to proprietary models like Claude Opus at a much lower price. They are not all the way there yet. On this task MiniMax M3 matched Claude Opus 4.8 at medium and high but still came in two issues behind its higher settings. The practical approach is to test a few models on the kind of work you actually do, look at how they perform, and pick based on your requirements, budget, and how much coverage the task needs.

source & further reading

blog.kilo.ai — original article Introducing JetBrains Context: Repository Intelligence for Coding Agents Does “rtk” skill really cut agent tokens by 60–90%? We tested it PhpStorm 2026.2 is Now Out

We Audited the Same Codebase with Claude Opus 4.8 and MiniMax M3

Results From Our Test Runs #

Run your AI side-project on zahid.host