Agentic coding notes from Galapogos Island

wpnews.pro

I've been using AI fairly heavily since last November and the whole thing is a funny experience. An agent will do something that, if a human did it, you'd immediately fire them. My reaction, of course, is to act as if this is great and spin up a thousand agents so they can do even more of that.

Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn't have tests and git bisect

wouldn't work, and it was a UI interaction bug for which I'm not even really qualified to write a test for, so I asked codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn't possibly be correct). On telling codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.

I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn't have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn't feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

Like I said, because this was non-ironically such a great experience, I immediately thought to myself, "how can I get more of this?" and started using agents more and more heavily until I was using coding agents heavily mid-late last year.

Since this post covers a relatively disparate set of topics, here's a brief outline.

Testing background Some details on testing Caveman mode LLM variance Misc Agentic loops and writing this post Some reasons people talk past each other

Testing background

LLMs are highly leveraged when it comes to testing. In terms of the amount of effort it takes, it's easier than ever to hit a particular quality bar and yet, software seems to be lower quality than ever. A decade ago, we looked at the bugs I ran into in an arbitrary week. There were quite a few bugs then and I run into more bugs now, but I don't think this has to be the case.

For one thing, after a bug has been shipped, it's easier than it's ever been to use a data-driven approach to find and fix the bug. Just for example, at work, I tried creating a pipeline that goes from support ticket (chat or email) to pull request (PR). As far as I can tell, this works ok. Since I work for a company that has a traditional workflow, all of these fixes get reviewed by a human and, so far, we've had no known false positives. Per unit of time invested, it's also possible to do more thorough testing. Personally, I think this can be effective enough that I'm fairly comfortable trying to ship a large volume of code via "software factories" workflow because I've seen a testing-heavy no-review workflow that results in much higher quality than any review-reliant workflow I've seen or even heard of.

Like everybody, I have biases that fall out of my experiences. It just so happens that I spent the first decade of my career at a company whose test processes happen to work well in today's LLM environment. I talked about fuzzing as a default testing methodology on Mastodon, and a skeptic tried it out and immediately found some bugs:

so I reread the blog post and was very "dubious face" but no yeah, Claude fuzzing found several classes of bugs that are worth fixing

A number of other folks I've talked to have also tried adopting something like the testing flow we'll discuss here and they've all immediately found bugs in the software they work on, including bugs that don't get surfaced by just asking Codex or Claude to audit the code for bugs, find bugs, "test", "test more", etc. For example, Dennis Snell mentioned that he and a teammate, Jon Surrell, not only found bugs in the code they're working on, but also "in upstream dependencies, including the HTML specification, big-three browsers, and other open-source projects" with fairly low effort.

In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at my like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:

Dedicated QA / test engineers, with that being a first-class career path
No code review by default
Virtually no hand-written tests
Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests").
Regression tests take too long to wait for (3 months)
No unit tests

Just to give you an idea of the general structure, when I left (in 2013), we had about 1000 machines generating and running tests at all times for roughly 20 logic designers and 20 test engineers. This was on prem and the machines took up half a floor of the building we were in.

The general structure was that we had maybe 20% of machines running regression tests, and 80% generating and running new tests. Three months of regression tests is too much to gate commits on, so there was a much shorter list of tests that took maybe 10 minutes or so to run that people would run before committing. That commit tests would run on a special setup to run as quickly as possible, with overclocked machines that were the fastest machines money could buy, as well as a different simulator setup.

New failures would get found and reported as they happened and one to two engineers had a job of sorting through failures and triaging them (rejecting false positives, fixing issues in the test generator that caused them to generate false positives, etc.). In terms of the magnitude of the impact, unless you count culture as a separate item, (1) was probably the biggest difference between us and a typical software company, but also the most irrelevant for readers here, so I'll relegate the discussion to a footnote 1, except for this brief comment that testing is like any other skill; spending more time doing it improves skill and, since testing isn't a first-class career path at most major tech companies, people generally don't have the same level of testing skills at software companies as you see in some career CPU test engineers. In the same way that an engineer who who spends 20 years working on distributed systems or UX is going to be much better at it than an equally talented engineer who spends 5% of their time on distributed systems or UX, someone who spends 20 years working testing is going to be much better at it than somebody who spends 5% of their time on testing.

(2) is one of the things that makes some of the test practices we used at the chip company suited to AI workflows. We didn't review code by default because we trusted our test practices enough that review didn't, in general, add much reliability. We were shipping fewer than 1 significant user-visible bug per year, and review was done on an as-needed basis when someone wanted an extra set of eyes on something they thought was particularly tricky 2. With AI coding workflows, it's easy for one person to generate more code than any human or even any ten humans can review by hand. People have different levels of comfort with shipping code without review. Personally, I'm very comfortable shipping code without human review because I've seen it done on products that are technically more challenging than most software at most software companies.

I often see people say things like, "that's too much risk; we have millions of users" but, empirically, they're talking about a workflow that ships bugs at a rate that's maybe a thousand of times higher per capita on raw count, with the ratio being much higher if you adjust for severity. If a company were shipping bugs at, say, a hundredth the rate we were at Centaur while relying primarily on review to catch bugs, then I could see their point, but that's not what's happening at the typical software company where people don't want to move away from human review because of the perceived risk of shipping bugs.

(3) and (4) go hand in hand. Almost every software group I know of that's serious about reliability (various teams that ship reliable databases, distributed databases etc.) are at least directionally doing the same thing, although they might have a larger fraction of hand written tests. For the same reason it's considered a bad idea to rely on testing by interacting with the software yourself and observing whether or not the software appeared to work, it's a bad idea to rely on directly typing out the inputs to a test and the expected outputs. As previously discussed, it's just really inefficient to write tests by hand. For any given level of reliability, you'll get there more quickly if you prefer randomized test generation over hand-written tests.

(5) fell out of having a lot of tests find a lot of bugs. In general, if a test found a bug that we later fixed, we'd keep the test in our regression test suite forever. It turns out, if you find a lot of bugs with good tests, you'll end up with a large test suite. But putting that aside, just looking it at from a test efficiency standpoint, the standard setup in software of having the same set of tests run in CI for each PR is extraordinarily inefficient if you think about the what's more likely to find a bug, running the same test a thousand times in a day or, in the same amount of test time, running a thousand different tests.

(6) came out test efficiency concerns as well, in that we had a much smaller team than our competitors. That was a reason the company managed to survive for so long. While Intel was putting every x86 designer out of business other than AMD, our operating cost was low enough that the company survived until 2021, at which point it was acquired by Intel for $125M. With the company's tiny team size, it wouldn't have been possible to get reasonable test coverage with unit tests and hiring enough to do unit tests probably would've meant the company would've gone the way of the x86 efforts of Transmeta, Rise, Cyrix, TI, UMC, NEC, VM, etc., a decade or two sooner. From an efficiency standpoint, unit testing does pretty poorly.

To sum it up, we did quite a few things that most software people tell me are bad ideas (dedicated test engineers, no unit tests, no code review, etc.) and we had much higher quality than any software company I've worked for or any software I've used. Whenever I talk about this, people will say that this doesn't apply to software because CPUs only have X concerns and you can't do the same thing with Y. When I first switched from CPU design to software I thought that might be true, but I've since tried this testing methodology with every kind of Y that someone has mentioned this can't work for and it's worked for every single one, so I no longer find this very plausible (and the Xs generally involved incorrect assumptions of what hardware development is like). While there are real differences between hardware and software, when I’ve seen people lean on that as a reason that testing techniques don’t carry over, it’s been the case that the person is relying on some imagined factor that only seems relevant because the person doesn’t know much about hardware development.

One significant difference was the ratio of effort that went into testing vs. development, but the fixed costs of fuzzing are fairly low, so this is scalable to any level of effort and the efficiency gains are still there. And, due to the gains in test efficiency, the ratio of effort wasn't as large as software engineers generally imagine. We had about a 1:1 ratio of test engineers to developers and then spent maybe 10% of our time in a "freeze" state, where the goal was to find bugs and not ship new features, so a zeroth order estimate for the overhead here is that we spent 55% of our effort on testing and 45% on development, or we could've put 2.2x the effort into development if we spent zero effort on testing. If you look at a software company that's shipping significant bugs many times faster than we did and you declare an emergency and get people to spend 55% of their effort on testing, I don't think the ratio changes too much. Maybe they get to half the previous ratio or something, but the level of effort isn't really what's making the difference.

Nowadays, another thing people will say is, why bother with fuzzing when you can just ask an LLM to find bugs? I've tried doing both quite a few times now and my experience has been that fuzzing generally wins on latency to find a bug, and it dominates on finding more bugs and having a lower false positive rate. LLMs have fairly high variance (more on this later), so just asking Codex or Claude to find a bug can sometimes win but, on average, fuzzing has won.

Some details on testing

Despite the very positive things I've said about LLMs testing, LLMs seem pretty bad at testing. It's more that LLMs let you apply testing effort a lot more easily than before than LLMs are good at testing.

An extreme example of this is that everybody I've talked to who cares about quality or testing at all finds the tests LLMs generate by default, or if you tell them "Write tests", "Write more tests", etc., to be poor. People tend to rate the tests as somewhere between worthless and marginally useful, depending on their standards.

For example, Em Chu (a compiler engineer) says: The existing tests I'm working with aren't perfect, but are still above the bar LLMs seem to aim for, which I would describe as "thorough enough to smuggle a feature through human code review." For a compiler (compared to e.g. UI), where I'm guessing it's easier to write the average test, but a higher bar of correctness is generally expected of the end product, LLMs just suck. They are painfully bad at the adversarial "now, what if I do this" or "let's try the cross-product of everything" process humans use to write tests that actually find bugs

At the same time, I've seen a number of folks rave at how amazingly good LLMs are at testing when you tell them "Write tests", "Write more tests", etc. When I've looked into why people say LLMs are great at testing, what I've found is that people who did essentially no testing at all find LLMs to be great at testing. Well, that makes sense. If you go from basically zero testing effort to a tiny bit of testing effort, that's a huge win.

As of June 2026, directing LLMs to do fuzzing / randomized testing feels similar. I've tried using an LLM to generate a fuzzer and, for most projects, this will turn up real and often serious bugs within minutes. However, on looking at what the LLM-created fuzzer is trying to test, I have the same reaction as a normal programmer who cares about quality looking at LLM-created tests. The coverage of the LLM-generated fuzzer is curiously bad and misses all kinds of basic things you'd expect a hastily human-written fuzzer to cover. Depending on whether you're a glass half empty or a glass half full person, you might say that this says something about the test coverage of most projects, or that it says something about the unreasonable effectiveness of fuzzing.

At a high level, LLM-generated fuzzers from SOTA models today don't do a good job of "thinking about" how inputs should be varied to elicit bugs. Then, if you naively tell it about how inputs should be varied and to combine these, it will also not combine bug ingredients in a reasonable way. It's possible to give instructions that will work well, but this heavily relies on the user to provide direction.

If you're using randomized testing as "extra credit", to catch a few more bugs, or to replace traditional software testing processes, you can just tell an LLM to look for risky areas of the code and find invariants that might be violated and fuzz them. This works ok. When I've convinced people to try some randomized testing, they usually start here and find quite a few bugs they're happy to have found. Due to the nature of who's interested in trying out novel-to-them test techniques, this is often from people who've worked on some of the most well-tested and reliable code at the company and they can find bugs in their own relatively well-tested code.

If you want to use randomized testing to keep an agentic "software factories" workflow honest, then you need to have a way to deal with gaps in SOTA models because, when you're shipping the equivalent of hundreds or thousands of PRs a day into a project, everything that's not constrained from degrading will rapidly degrade.

At a high level, the entire system needs some kind of feedback that finds gaps and instructs whatever loop is making adjustments to the fuzzer to close the gaps. Recently, I've been testing things where I don't understand the domain and don't understand the project or the code, so I've been flying relatively blind and know that there will be a lot of gaps in what I come up with (and, as noted above, LLMs are terrible at this). But even in areas where I'm familiar with the domain and understand the code relatively well, there will still be some gaps because humans miss things and make mistakes, so there always needs to be some kind of feedback into the test setup that can find gaps and allow you or an agent to close the gaps.

I've been playing with various ways to have agents convene and reconvene to get agentic loops running better and, while that kind of thing helps, I haven't figured out a way to do do this to create some kind agentic software quality improvement loop that doesn't rely on some kind of outside feedback, whether that's occasional human input, or shipping something (ideally only fractionally and with staged rollout) and then having the system monitor metrics/logs/traces/support tickets/whatever to use that as feedback. The support ticket to PR pipeline I mentioned above is one such feedback loop. The pipeline not only tries to generate a PR, it also tries to get the test setup to add test coverage that will find the bug and possibly surface other bugs, or will re-find the bug if there's a future regression. This seems to work ok-ish, in that it finds real bugs and improves test coverage, but I'm sure there's a lot of room for improvement.

Relatedly, I've been wondering why LLMs are so bad at writing tests. On asking around a bit, I'm told that this is because the capabilities that LLMs have come out of people building RL environments which allow models to improve at tasks, sometimes in a generalizable way and sometimes not. I'm also told that there's a market for selling RL envs, but it's fairly thin because there aren't all that many buyers for them, and you really want to know someone at a lab who's a buyer or close to it. If you are such a person or can connect me with such a person, could you do me a favor and reach out to me (I'm fairly easily reachable on X, Mastodon, email, etc.). I'm curious about how this works and how plausible it is to sell an RL env for something like testing, optimization, or the longer horizon tasks discussed in this post, where it's easy to observe significant gaps.

Back on the topic of testing, when fuzzing or doing any kind of bug auditing, detecting false positives is a critical part of the process. At least for now, having access to a model that's better than anything you can publicly use won't save you. A while back, Dennis Snell told me, frustratedly, that he spent the day wading through AI slop forwarded to his employer by Anthropic that came from their vaunted Mythos model that's too dangerous to release from the public. Anthropic was apparently doing the company some kind of favor or maybe doing some kind of EA security improvement, except that they didn't bother with having a reasonable false positive rejection process so they were just forwarding garbage to us. At the time, I was using a model that, if Fable is any indication, appears to be moderately less capable than Mythos, but I had no problem generating an endless stream of bugs (some of which were security issues) with no known false positives, which seems to indicate that having a reasonable setup around the model is a least as important as having the latest and greatest model.

I've been trying custom workflows on a per-project / per-problem basis and don't exactly have a generic false positive rejection scheme, but there are various things that seem to be semi-generalizable. If you don't mind spending tokens, having independent I mentioned that I had good luck using different "personas" for reviews as well as for managing agentic loops and I got some responses with theoretical reasons this doesn't work well but, in practice, it seems to work fairly well. My workflow changes regularly and maybe a week after that discussion I started adding "contrarian" personas to the mix, which improved performance given the same wall clock or token budget.

For anything human reviewed, having some kind of artifact (e.g., a video if it's a bug that's expected to be apparent in the UI) is necessary. Without really explicitly trying to have the agent review this, just producing this at all seems to reduce the false positive rate somewhat, and then having the agent review the artifact reduces the false positive rate further. Asking agents to independently review the artifact (e.g., looking at the test code that produces the video vs. looking at the video itself) also reduces the false positive rate further. In general, getting independent perspectives seems to help a lot with reducing false positives. In the experiments I've run, this has been less effective than having agents with different personas / perspective wall clock time or dollar, but just asking the same question multiple times improves results, which we'll discuss in more detail in the section on LLM variance. Pretty much everything I've tried to reduce false positive rate has worked, so if you're not scaling up a workflow to the point where optimizing the costs matters, doing anything remotely reasonable seems to work fairly well.

Caveman mode

I keep getting various tool and workflow recommendations and, when I look into it, I can almost never find good information on whether or not it makes sense to adopt the recommendation. Just for example, I've seen "caveman mode" recommended multiple times at work. Caveman mode allegedly reduces token usage and speeds up prompt resolution (the README claims, 75% reduction in token usage, 65% reduction token usage, and 2x fewer tokens used, as well as a 3x speed increase).

Searching for information (just googling 'caveman mode', no quotes, one of the top hit that wasn't a link to caveman mode was a this reddit thread where, of the three top comments, one is joke and the other two highly recommend it:

Just extreme brevity in a refreshing way... and dramatically lowered token count without any seeming impact on the analytical thinking... but i have no way to benchmark before and after.

Someone at work is testing it and it seems to actually save tokens AND work just as well.

Most of the rest of the top hits were also positive recommendations for caveman mode that purported to do some kind of eval (although they read like unfiltered LLM text) and the top hit on YouTube was one of the the biggest programming YouTubers saying

it actually works; it actually works quite well ... no, I'm not exaggerating

In a slack thread at work where people were recommending caveman mode, I asked if anyone had done a comparison, noting that the creator of caveman mode responded to the HN thread about caveman mode by saying it's a joke. Someone linked to an analysis of caveman mode that claims a significant win, but the analysis was an LLM-generated SEO spam article with numerous errors. When I politely pointed this out, the person who posted the link said "I only skimmed it".

At that point, I decided to spend about 15 seconds a piece generating some caveman mode benchmarks (it seems like people call benchmarks evals now, so I should call these evals?) (previously discussed in more detail here).

To start with, I'd been using a lot of GPT-5.5 xhigh when this discussion came up a couple months ago and I benchmarked this thing, so let's look at how this looks for GPT-5.5 xhigh on the first benchmark, a simple benchmark where we ask the agent to optimize some code in wasm. Since this is something I spent 15 seconds prompting an agent to generate, I don't think it's worth spending a ton of time discussing the details, but one thing to note is that it's possible to do much better than any of the results an agent achieved here. I would expect a human doing this by hand or a human who's being prescriptive about what the agent to do to get results that are literally off the charts in the charts below. For the optimization chart, 1.0 is no speedup and higher is better (below 1.0 means the "optimization" slowed things down). To give you an idea of what this looked like when running the experiments, you can click the buttons to follow along interactively or just play an animation.

We can see that, for the first benchmark (optimize a non-trivial algorithm in wasm), after one run, caveman is looking good. We get 1.027 speedup vs. 0.987, $12.10 vs. $23.10, and a wall clock time (of how long the agent took) of 8m51s vs. 14m9s. But we know that LLMs are stochastic, so we should probably run again. After a second run, we can see a big change in the results, with an average of 1.0 speedup for both, but caveman mode coming in at $12.45 in 8m64s vs. $40.38 in 17m57s. That's a huge cost savings that's in line with the claimed savings from caveman mode. It's a bit silly to narrate each step of the animation, but if we skip to the end, we can see that the average after 50 runs is in favor of caveman, with 1.03 vs. 1.01 speedup, and $17.97 in 13m46s vs. $24.21 in 16m52s. That's not as good as what we saw after two points, but that's still solidly in favor of caveman mode.

I asked GPT-5.5 xhigh to do classical and Bayesian statistics on this and it produced a script that says that, for the Optimization 1 benchmark, the p-values for caveman having better speedup, cost, and wall clock time, are 0.1, 0.005, and 0.001, respectively. With Bayesian stats, we have P(caveman better) 0.958, 0.999, and 1.000, respectively. We can look at the plots for the other two benchmarks I spent 15 seconds on as well. Optimization 2 is another "optimize this code in wasm" benchmark, and Game AI is a task where the agent is asked to implement a board game AI for the game Lost Cities with a deadline of 10ms per move.

The results with these benchmarks look mixed. For Optimization 2, we have P(caveman better) = 0.17, 0.999, and 1.000, respectively, and for Game AI, we have P(caveman better) = 0.04, 0.79, and 0.73, respectively, so caveman actually gives worse results for Optimization 2 and Game AI, which is the opposite of what we saw for Optimization 1. BTW, I didn't cherry pick the order of these results to present some kind of surprising narrative reversal. If we were to stop here, we might think that caveman gives worse outcomes but saves money, or maybe it gives better outcomes on some tasks and worse outcomes on some tasks and saves money.

If we try a few more models (GPT-5.4 mini, GPT-5.4, GPT-5.5) at every effort level, we get the following averages, which makes the overall picture less clear (in the graphs below, up and to the left is better, down and to the right is worse; the arrows point from the baseline to caveman): What are the patterns here? To name a few, for Optimization 1, caveman generally has better results than standard, but for Optimization 2 and Optimization 3, it's mostly the other way around, although there are exceptions. For cost, we can see a variety of patterns as well. There's enough variance between conditions (tasks as well as models and effort levels) that it's clear that we'd have to run a lot more conditions to get a clear picture of what's going on and, overall, the difference averages out to be small enough that it doesn't seem worth using caveman mode.

LLM Variance

Recently, when new models have been released, I've done a search to see what people are saying about them. In general, there are a lot of contradictory comments out there. For example, when GPT-5.5 was released people said, variously, GPT-5.4 is better than 5.5 because it's better at staying on task while 5.5 wanders off and overthinks the problem, making 5.5 much more expensive and pointless; 5.5 is so much better than 5.4 that it's cheaper to use because it doesn't mess up and then get stick fixing its own issues as much; 5.5 is cheaper than 5.4 because it works so well you can run at a lower effort level; 5.5 "just works" while 5.4 often fails and needs handholding, etc. Often, someone will run a benchmark and show that their statement is true.

Looking at these benchmarks, we can see support for all of these statements that I saw on reddit when searching for comments on GPT-5.5 shortly after release. In Optimization 1, GPT-5.4 has better results than 5.5 and is much cheaper. But in Game AI, GPT-5.5 is substantially better than 5.4, so much so that 5.5 high costs about as much as 5.4 xhigh, but with better results, and 5.5 medium is cheaper than high with significantly better results. With just these three evals, you can find support for every statement I saw people making about GPT-5.5 on release because all of the statements are sometimes true. And that's when we're averaging out variance with quite a few more runs than any reasonable person is going to make to support some comment they're throwing out on the internet.

In general, this kind of thing is why, when I see a metric or graph that summarizes a set of benchmarks, I think, "show me the distribution". Benchmarks of models often reduce to a single, nice, neat number, where you see that X is better than Y, which is better than Z. I find these to be basically meaningless, in that, if we're looking at the latest and greatest from OpenAI and Anthropic, we know there are reasonable benchmarks where X is better than Y and vice versa. If the set of benchmarks had a few more benchmarks that favored Y instead of X, the results would be flipped. For some kind of summary metric like that to be useful to me, it would have to be the case that the set of benchmarks perfectly mirrors the distribution and weight of tasks I do and that I can only choose a single model to use for all tasks. Since neither of those is true, it’s not clear what actionable information I can take away from these benchmarks.

If we look at public benchmarks in more detail, the situation seems worse than it appears from the abstract argument above. Results are generally presented in fairly high precision, as if that's meaningful. For example, on one benchmark, we might see that (for example) GPT-5.5-xhigh is 1% better worse than Fable 5 medium, but at 19% lower cost. And then if we compare to Opus 4.8 maybe it's 13% worse than GPT-5.5 xhigh at 11% higher cost. If we want to know what this means, we can dig into the data and see that we have some benchmark that claims to be meaningful because it has this big set of diverse tasks, but they're all pass/fail tasks that get run 4 times and most tasks are either very easy and get 4/4 with the best models (except, due to some random noise, you sometimes see a random 3/4) or are very hard and mostly get 0/4. Then there's some small subset of tasks that actually determine the relative scores of these SOTA models. If you change out one of these for a different one, the results between the two highest scoring models can get flipped. If you change a few tasks (out of about 100), then you can see the apparently much worse Opus 4.8 move ahead of GPT-5.5. Change a few more tasks and GLM-5.2 can pull ahead. When I see things like this, it reminds me of Miguel Indurain, who was enough of a household name when I was a kid that I'd heard of him even though I don't follow cycling. A few years ago, I was curious why household names in cycling since Indurain are all different archetypes from Indurain and it turns out the answer is that it's arbitrary. For arbitrary reasons, the Tour de France has become the most famous cycling race in the world and someone who has a dominant streak can become famous enough that they become known outside of cycling circles. For other arbitrary reasons, there was a period of time where the TdF had much longer time trial stages than it does now, which suits someone of Indurain's archetype. You tweak the benchmark a bit and Miguel Indurain goes from being a once household name to an all-time great time trialist that pretty much nobody has heard of unless they follow cycling.

Back on the topic of coding agents, it's not clear who really needs to pay attention to these benchmarks that present summary metrics of how models are doing. As we noted above, as a user of models, these benchmarks don't meaningfully tell me what model I should use. But if many other users based their decisions on these benchmarks, then AI labs would need to care about their results on these benchmarks.

But even though GPT-5.5 has been handily beating the various Opus 4.x models during 5.5's tenure on most of these benchmarks, but Anthropic's business grew much faster than OpenAI's during the time period that the best publicly available models were GPT-5.5 and Opus 4.6/4.7/4.8, so much so that OpenAI has been giving companies free tokens to try to convince people to use GPT. My company was one of many to get months of free tokens and, during that time period, most people still primarily used Claude and Opus. Anthropic's revenue trajectory is incompatible with these benchmarks being major determinants of user choice, so I don't know why anyone should really care what these summary metrics show.

The last set of graphs we looked at shows how much variance we see across tasks, but from the prior set of graphs, we also saw a lot of variance within tasks with the same model and effort level. If we look at a small number of individual runs, then pretty much any conclusion is possible due to the variance between runs. Just for example if we look at Optimization 1, for GPT-5.5 xhigh, one standard deviation between runs is 0.075 (i.e., 7.5% performance increase / decrease). If we look at the average difference between the best and worst tested GPT, that's 1.055 (GPT-5.4 xhigh caveman) and 0.986 (GPT 5.4 mini low), which is less than 1 standard deviation (SD) across GPT-5.5 xhigh. For the actual graphs and not just summary statistics, we have:

source & further reading

danluu.com — original article Against essential and accidental complexity (2020)

Agentic coding notes from Galapogos Island

Testing background

Some details on testing

Caveman mode

LLM Variance

Run your AI side-project on zahid.host