For many years, I had wanted to bring Direct3D virtualization to QEMU. I have tried and given up multiple times because the problem felt intractable. A proper solution required expertise in many niche areas of system design including: virtualization, GPU, Windows kernel, graphics APIs, etc. Each of these topics is deep enough on its own to build an entire career. Now with AI agents getting better each week, I decided to give it another shot. I know using AI is seen as an albatross in some circles. There are loud rich AI CEOs who say that programmers will completely be replaced by AI. Meanwhile, concerned open source maintainers talk about low quality slop being spammed in projects that create extra work for maintainers already stretched thin. In fact, we have experienced this increased stream of low quality submissions to UTM with one egregious example of a “security researcher” submitting a “bug report” as a publicity stunt for their new company without taking any time to understand the project’s threat model or trying to communicate with the maintainers before publishing. However, my belief is that AI tools should not replace human thought but instead be used to amplify creativity. AI can complement the gaps in my knowledge and I can direct it with my systems experience. I want this post to both be a technical document describing the history that led up to Neptune as well as a case study on how AI was used to bring that idea to life.
Neptune is an extension of virglrenderer for transporting Direct3D APIs over Virtio GPU, the device used by QEMU to provide para-virtualized graphics. It joins vrend (OpenGL), vDRM (Linux DRM), and Venus (Vulkan) as a protocol which virglrenderer can speak. Currently, Neptune only works on Linux host (DirectX 11 through DXVK) and Linux guest. This was a choice made to simplify the bring-up process and to get early feedback from the community. The next phase is to add macOS host support and Windows guest support. After that, Neptune will be extended to support DirectX 12.
Results #
I want to start with the results to highlight what is already working. The following benchmarks were selected:
3DMark Fire Strike: a classic D3D11 benchmark which does optimised rendering** Unigine Heaven**: an older benchmark that is one of the few games that feature heavytessellationFinal Fantasy XIV Dawntrail: features submit bound workload** Civilization VI**: one of the few games that features multi-threaded rendering
The main reason these games were selected though is because they are all free (and I already own Civ 6) and each has a clear score which can be compared to native DXVK (running on Venus). This is important to make sure the performance is within expectation. While correctness can be checked visually, it is more difficult to check performance. Thanks to Venus though we have a lower bound on what is “acceptable” performance. When DXVK is used as the Wine D3D back-end, DXVK does state tracking and creates a combined Vulkan command buffer. The Venus driver sends the command buffer to the host to execute on the GPU. Neptune does not do D3D11 state tracking and passes the API calls directly to the host. This means that there is a lot more traffic over the ring buffer shared by the guest and host. Therefore, I had expected a small but non-negligible drop in performance when running Neptune vs DXVK+Venus.
However, I was surprised to see in each case, the performance increased. Here is Claude’s theory as to why:
Neptune moves DXVK from competing for 4 guest vCPUs (next to the game) to the host’s many cores running against native radv with zero Vulkan-side virtualization, and that win dwarfs the extra ring chatter whenever GPU work or DXVK CPU work — not app-thread work — is the actual ceiling. DXVK is CPU-heavy. Hazard tracking, render-pass building, descriptor-set diffing — all hot. Moving that work out of the guest VM is a big deal even before you count the Venus saving.
3DMark Fire Strike
Unigine Heaven
Final Fantasy XIV
Civilization VI
The test machine used with a 2018 Intel NUC Hades Canyon running AMD Polaris graphics. QEMU is run on Ubuntu 24.04 with -accel kvm -cpu host -smp 4 -m 16G
. For reasons explained above, the performance differences should not be extrapolated to mean the Neptune will outperform Venus in general. Instead, the takeaway should be that the performance is no worse than Venus in typical workloads and that is an important data point to have once we start porting Neptune to platforms where it will not be as easy to get side by side comparisons.
Alternatives #
Before going into the details on Neptune’s design, I want to detail the history of my previous attempts at bringing accelerated graphics to QEMU. This highlights the difficulty of the problem at hand.
VirtualBox
The first idea was to port VirtualBox’s SVGA device acceleration to QEMU. VirtualBox has DirectX 11 support through DXVK and QEMU has base support for the VMWare device that VirtualBox emulates. Both the device and Windows driver are open source. The big win here is the Windows WDDM driver which in theory we can take as-is. This is one of the few open source implementations of Direct3D that exists and may still be good reference material in the future. However, the complexity of fitting the device into QEMU required a lot of work (back when I first looked at it, there was no AI coding assistants to help). There may also be unforeseen issues bringing the device to ARM64 guests and also I did not want to maintain a large amount video device code in QEMU.
gfxstream
gfxstream (previously: Vulkan Cereal) is the Vulkan backend for Google’s Android Emulator (and some other maybe abandoned projects). In theory it can be paired with DXVK to get DirectX support and I was informed by a very smart individual that they had an easy time porting guest support to Windows (they sadly cannot release the source code because of legal issues). This seemed like an attractive path because Android Emulator is based off of QEMU (forked from a much older version) and more recently, support was added to upstream QEMU as well. The remaining work was to port the host side to macOS (which I worked on). I got pretty far with this idea: I was able to get kmscube to render on a Linux guest with a macOS host. But in the end there were a couple of blocking issues. First, although Android Emulator supported macOS hosts, the Android graphics stack is completely different from Linux (I am not even thinking about Windows yet). The Linux Mesa driver was pretty new and still unstable. Second, the Vulkan driver on macOS, MoltenVK does not have all the features to implement Zink, the OpenGL implementation over Vulkan. Why do we need Zink? Because most popular Linux distros (i.e. Ubuntu LTS) uses GNOME window manager which uses Mutter for compositing and the accelerated path uses Cogl… which is OpenGL based. Did you follow all that? The short version is that the current gfxstream guest driver paired with MoltenVK running on macOS host does not support hardware accelerated window compositing for Ubuntu LTS. I also looked at forgoing Zink and using a different OpenGL driver but that was not something that is currently supported and would require both work on the guest driver side (something I wanted to avoid in the first place) and on the macOS host side. For example, if I used the existing OpenGL forwarding path in gfxstream (supported by Android guests), I would have to bring up a Gallium driver for it. If I tried using the VirGL driver (which is already supported by QEMU on macOS), then I need to plumb compositing gfxstream and VirGL contexts together through virglrenderer which also required QEMU level changes. Overall the complexity of all the changes needed quickly added up (meaning maintenance costs as well) and this is all before even considering what to do on the Windows side.
Gallium for Windows
Let’s put aside getting a modern graphics stack (Direct3D 11 or Vulkan) working on Windows and focus instead on getting any accelerated graphics on Windows. For more than a decade, there has been multiple attempts to get a VirGL guest working on Windows. The idea here is that virglrenderer is already a mature protocol for transferring Gallium between guest and host so we just need to implement the guest driver for Windows. Once virtio driver for Gallium works, then OpenGL acceleration is possible. But not just that, Direct3D 10 can also be supported thanks to an existing frontend (currently only for software rendering).
Quick detour of terminology for those who are less familiar with the Linux graphics stack:
- Mesa: started as an open source implementation of OpenGL APIs and then expanded to GLES, EGL, then Vulkan, and even D3D9/D3D10 (swrast only). Mesa handles the complex state tracking required by different graphics APIs and translate them to Gallium.
- Gallium: various legacy graphics APIs (OpenGL) are translated into the Mesa internal API, Gallium. Then various downstream drivers implement Gallium for different hardware. The advantage is that driver developers can target one API set instead of half a dozen. virtio is one such driver which serialises Gallium calls across the virtualization boundary.
- virglrenderer: a library for deserialising Gallium commands and implementing them with OpenGL. In a way you can think of it as an inverse of Mesa. Why do this instead of just transferring the OpenGL calls directly over the wire? I can only speculate but my guess is that you do not need a separate OpenGL driver in the Linux guest (avoiding multiple OpenGL libraries is one of the reasons why Mesa exists) and the guest can also be agnostic to what APIs are actually supported on the host. In fact we take advantage of this second point with UTM on iOS. We use (the Google project) ANGLE, a GLES implementation on top of Metal (the Apple native graphics API). virglrenderer supports rendering with GLES and therefore we can support desktop OpenGL 2.1 on Linux guests without ever needing to implement it on iOS.
So in order to bring OpenGL to Windows, we just need to implement a Gallium driver on Windows. Luckily, software rendering already exists for Windows so the only work is to communicate with a kernel mode virtio driver in order to efficiently transfer buffers from guest to host. Indeed, max8rr8 started work on this three years back and was able to get it to a proof-of-concept state. This was very impressive work because as stated earlier, knowledge of Windows WDDM drivers is few and far between. Having looked extensively into this topic myself, I am disappointed by the opaque nature of the MSDN documentation and the dearth of public discussion in places like community.osr.com. It feels like most knowledge is siloed between Microsoft and the three large chip vendors. However, as monumental of an achievement as max8rr8’s driver was, there was little guidance from the corporate backed maintainers of the project to help the new contributor upstream their work and the effort fizzled out. I expressed my disappointment but unfortunately this is a common theme in the big tech takeover of open source.
Not all hope is lost though because as of recently, work has continued from two other contributors. anonymix007 and arehnman have both independently picked up where max8rr8 left off to bring not only Gallium support to Windows but Vulkan as well. Which brings us to…
Venus
Venus is another project maintained by Google and is part of virglrenderer. Vulkan support on Linux is also handled by Mesa but not through Gallium. That means virglrenderer needs to have a separate back-end for transferring Vulkan commands over the virtualization boundary. Whereas Gallium commands are transferred directly over the virtio-gpu device through the virtio transport with the VIRTIO_GPU_CMD_SUBMIT_3D
command, Venus is handled “out of band” by a shared buffer of memory between the guest and host which acts as a ring buffer for serialised commands to be written to. This improves latency and allows for more efficient batching of commands. This is very similar to how gfxstream works as well, but one advantage of Venus is that there is already support for import/export of Vulkan resources into the vrend (OpenGL) context. This means that a Cogl based window compositor on Linux can display a Vulkan window without any CPU side buffer copying.
Ultimately, we were able to port Venus to macOS and iOS in the latest UTM beta. This brings Vulkan acceleration to Linux guests without needing any guest side driver patches. Everything should “just work”.
Mission accomplished, right? Not quite. Having modern graphics working on a Linux guest was a major milestone for us but our north star is full graphics acceleration on Windows. A stepping stone to that is for DirectX to work on Linux through DXVK (DirectX 11 implemented on top of Vulkan) and we ran into some roadblocks there. Again, much of the issues come back to MoltenVK, the macOS/iOS implementation of Vulkan APIs over Apple’s Metal APIs. MoltenVK does not support all the Vulkan features that DXVK requires. CrossOver (a commercial distribution of Wine) for macOS has a fork of MoltenVK and DXVK that is tuned for higher game compatibility for Mac. We tried to integrate their fork into UTM but ran into various issues running Linux guests (their fork was based on older versions of the project that lack support for features we need for virtualization). After merging some of their changes together with upstream MoltenVK, some games started to boot. However, there are still graphical issues in-game.
Can we do better? The biggest hurdle is the fact that when we stack multiple layers of API translation, compatibility is limited to the least supported component. This is what the stack currently looks like:
and for macOS hosts:
Windows guests would have a similar picture. On macOS the weak link is currently MoltenVK which limits the full graphics capability of the system. In the time since Venus started working on macOS, KosmicKrisp came out and is steadily improving. However, it is still currently not at feature parity with MoltenVK and does not currently run DXVK.
To recap, for Linux hosts, DXVK + Venus + host Vulkan is a fine combination with good compatibility. However, on macOS, Vulkan is not a native citizen. Apple’s own graphics API is Metal and recently, Apple introduced the Game Porting Toolkit which comes with a framework called D3DMetal. D3DMetal translates D3D11 and D3D12 directly to Metal and is already used by CrossOver to gain support for games previously unplayable even on their tuned DXVK. If we can integrate D3DMetal with virglrenderer and we remove DXVK on the guest side and transfer D3D commands directly over VirtIO GPU, then we can skip Vulkan/MoltenVK altogether.
You can imagine other useful combinations: Windows guests on Linux hosts (DXVK on host side) or Windows guests on Windows hosts (native D3D) or even Linux guests on Windows hosts (Wine on the guest with no DXVK). However, the scope is large enough as it is already so none of this is currently planned. In this first phase, the emphasis is to get the least useful pairing working: Linux guest (no DXVK) on Linux host (DXVK). Why this pairing? Because this is the most mature and fleshed out starting configuration. It also provides a quick way of checking results by comparing directly to DXVK + Venus (as seen in the results above).
With the “why” out of the way, the next part will detail the “how.” This is the first large project I’ve worked on where I heavily depended on AI tools (specifically Claude Code). To give more insight on how we worked together, I asked Claude to analyse our chat transcripts to provide insight on how the collaboration worked.
Neptune is a GPU virtualization back end that lets Windows D3D11 applications run inside a guest VM and render on the host's GPU. It is parallel to Venus (which does the same for Vulkan): the guest serialises COM method calls into a shared-memory ring buffer; the host deserialises them and runs them through dxvk
(D3D11→Vulkan); rendered frames come back to the guest as dma-bufs and are presented through X11 DRI3.
The work spanned four tightly coupled repos:
| Repo | Role | Hand-written code touched | Commits |
|---|---|---|---|
mesa/src/virtio/neptune/ |
virglrenderer/src/neptune/
neptune-protocol/
(forked)dxvk/
Neptune is not a "rewrite the world" project — it's deliberately a Venus clone with D3D11 substituted for Vulkan. That structural similarity is load-bearing for everything that follows: most of the time, the right answer to "how should we handle X?" was "look at Venus."
I had a general design direction in mind but the amount of code that needs to be brought up, tested, and optimised was daunting. My own estimates for the amount of work (without any AI tools) was 6-8 months (this includes time needed to learn new components). From the previous experiences bringing up VirGL on ANGLE, gfxstream, and Venus, I knew that the bulk of the work will be in debugging. The debugging task is painful because often you are crossing both user/kernel boundaries as well as guest/host boundaries. Issues with incorrect drawing is especially difficult because it involves the GPU as well. In the past, I’ve had to attach GDB in the guest and LLDB in the host to step through a single draw command. Combine all this with the multiple layers of API translations, cache coherency, and race conditions that disappear upon any profiling and you find yourself spending weeks upon weeks tracking down bugs. I want to put the AI in the best possible position to debug issues without my intervention so I came up with the following rules.
- The project needs to be broken down into multiple parts. Each part needs to be broken down into smaller goals. Each goal should have a clear pass condition and that condition needs to reflect the complexity of the next goal.
- The overall design needs to be anchored to an existing design that is already proven to work. In this case, that is Venus. Anchoring to a working design allows the design space to be more constrained and helps avoid issues where the AI picks a sub-optimal implementation choice due to not considering a better option.
- The AI should be able to get feedback on what the “correct” behaviour is without consulting the human. Knowing that DXVK is in a good working condition on the test machine is crucial for this.
The numbers #
| Metric | Value |
|---|---|
| Tokens read from cache (≈ 7× The Lord of the Rings per minute, sustained over 32 days) | 22.5B |
| Tokens written by Claude | 53.6M |
| Tool calls (Bash, Read, Edit, …) | 34,023 |
| Sessions (80 primary + 228 subagent) | 308 |
| Hand-written-code commits across 4 repos | 657 |
| Lines of hand-written runtime C code (mesa + virglrenderer) | 22.9K |
The timeline #
I discovered with great dismay that Claude Code automatically deletes transcripts older than 30 days. Therefore this analysis does not include the first month of work (starting March 8). As such, I will give a brief summary here. I spent the first day (a whole day) crafting this initial prompt. I wanted to give a high level structure of the work without being too prescriptive. I have noticed that Claude Code’s plans are usually best if you accurately describe WHAT you want without too many details on HOW you want it. If you don’t like the plan, you can always discard it and try again but I have found that Claude’s plans are usually pretty solid. That being said, a major mistake in this first iteration was under-specifying how I wanted the command serialisation to be done. I had a vague idea that “gfxreconstruct has some way of doing it so you can steal its homework.” However, that ultimately did not work because it differed too much from how Venus handled serialisation that it made everything else a battle upstream.
It was Claude who came up with the idea of first parsing the SDK header files into a JSON database and then generating serialisation functions from that database. This decoupled the “header parsing” from the “code generation.” I liked this idea but thought it can be improved by looking at the .idl
files that Microsoft provides rather than the .h
files which themselves are generated from the MIDL. Claude wrote a Python script that parsed the MIDL files into JSON but upon testing, there was mistakes all over the place. It was worth taking yet another step backwards and focus on just parsing MIDL in isolation. The end result was midl-classic, a Python parser for MIDL which converts the MIDL into an AST. I gave Claude a copy of the MIDL specification documents that Microsoft publishes and a TypeScript syntax highlighter that Microsoft open sourced and asked Claude to implement the parser fully to the specs.
With the MIDL parser in hand, the next step was to design neptune-protocol which uses midl-classic to convert the SDK MIDL files to a JSON registry, combine it with a manual overlay, and then generate code from it. Claude was asked to use venus-protocol (Vulkan has the APIs already in machine friendly XML format) as inspiration to create the protocol generator for serialisation. One issue from the first failed attempt was that it was difficult to check the correctness of the generator because there is no easy way to say “does the generated structs match the layout in the SDK headers.” You can ask the AI for that but you will never know if it looked at every structure or didn’t make any mistake. Instead you need to ask it to create tests to exercise all the functions and collect coverage data on it.
Once the protocol generator was working, Claude was asked to re-design the virglrenderer code to use the new generator. This was a massive rewrite and really demonstrates one benefit of AI coding: large rewrites and refactors are now cheap which means the opportunity costs of trying one way, figuring out you went the wrong direction, and starting over is no longer days of work but just hours of work.
Next, we extended DXVK-native to support a headless WSI that exports dmabuf instead of rendering the frames directly to screen. The advantage of doing this is that virglrenderer already understands dmabuf as it is used in Venus and other back-ends. dmabufs are also a cheap way to move data from host to guest without needing to copy data from GPU to CPU each time. Initial smoke tests were also created: a single static triangle and a spinning cube. No matter how hard I tried, I was not able to get Claude to understand what a cube looks like. It gave me some kind of spinning geometry with 6 sides but it definitely was not how a cube would look in 3D space. My own lack of background in 3D graphics means that I was not even able to describe to Claude what the issue was and when I attempted to give it screenshots of frames it was also unable to glean any insights. In the end, I gave up because it doesn’t matter for the future tasks if the 3D cube was indeed a cube. It just had to have geometry and animation.
The guest side was uneventful. There was some initial struggle trying to get Unix syscalls (like SCM_RIGHTS
) to work through Wine. Claude kept coming up with progressively more complicated and fragile hacks until I gave it the Wine source code where it was able to figure out both how Unix libraries are loaded and how to call into them. At that point, we were able to get the triangle and cube to render through vtest (a test back-end that talks directly with virglrenderer on the same host without QEMU or KVM). Then we set up a VM, copied the built libraries to it, and got the smoke tests to work across the VM boundary as well. All of this was straightforward thanks to existing Venus code which can be used in Neptune here mostly unmodified because the transport layer is essentially the same.
A condensed narrative of what actually happened, drawn from the first user message of each major session:
| Date | Phase | What happened |
|---|---|---|
| Apr 14 | bringup | “We finished implementing virtio transport for virglrenderer and mesa Neptune. Now it's time to test it on a real game.” — first session of the test era; Crash Bandicoot N. Sane Trilogy is the target. |
| Apr 15 | bringup | Apitrace setup, Wine integration, win32 handles for events/fences. First real crash debugging. B/R channel swap mystery. |
| Apr 16 | review | First “code reviewing the newly implemented Neptune backend” pass — Venus is the reference, every divergence is suspect. |
| Apr 17 | perf | First gameplay-aware perf analysis. Discovery: encode + reply waits dominate. Plan for “Venus-style per-thread encode batching.” |
| Apr 18 | perf | The TLS-ring saga. Multi-ring stalls. UAF bugs surface only when multi-ring is on. “Keep going and don't stop until you are able to run the game 10 times, each for 5 minutes without any hang/freeze/crash/deadlock.” By end of day: 2.37× throughput on Crash. |
| Apr 19 | perf | Texture-map fast path (P4 → P1 → P2 → P3). Heuristics for multi-ring default-on. Wider games analysis. |
| Apr 20–21 | review | COM-type cleanup. Wrapper consolidation. Override macro work. |
| May 1–2 | review | Wine-only consolidation (drop native-Linux paths). Future-fence feedback. Comments cleanup: “all the comments in the Neptune backend has been written by Claude. Much of it is too verbose, duplicated, or useless.” |
| May 3 | debug | dmabuf WSI rearchitecture and 5-hour freeze hunt. Root cause: 32-bit seqno wrap in npt_ring_seqno_status . The biggest single session in tool calls (4,059). |
| May 4–5 | perf | Native-DXVK vs Neptune deep comparison. WC + ring-ordering puzzle. memcpy attribution drama (it was in game.exe all along, not in Neptune). |
| May 7 | review | Big code review pass with Venus parity checks. /loop-driven iteration. Profiling Wine library code. |
| May 8–9 | perf | Apitrace integration crash fix. Frame-pacing rewrite. xcap custom capture tool. vtest-vs-wine perf parity achieved. |
| May 10–11 | debug | Lockless seqno fast path. FFXIV Dawntrail and 3DMark Fire Strike bringup. New games expose latent bugs immediately. |
| May 12–13 | debug | Out-of-order present FIFO bug (the “blink” visible to the eye but not in dmabuf capture). Variant analysis on protocol generator bugs. |
| May 14–15 | debug | 3DMark termination stall (sc_wsi_stop 's INFINITE wait). “Stop hook”–driven autonomous fix loop. 588 s → 421 s on a single test. |
| May 16 | cleanup | Squash to upstream branch. This report. |
Three things are worth pointing out about the shape of this timeline:
It is not a feature-build curve. Most of Neptune's runtime code existed before Apr 14 from earlier sessions on a different machine. What we see here is the much harder phase:turning a thing that compiles into a thing that actually runs a 3D game, and then a thing that runswell.Review and debug dominate. Of 39 major sessions, 15 were code-review/refactor, 9 were performance work, 6 were bug-hunts, 1 was the bringup, and the rest mixed.Bringing the code up to qualitytook more iterations thanwriting the code.Each new game broke something. Crash worked first; then FFXIV exposed protocol-generator NULL-derefs; then 3DMark exposed the WSI-thread shutdown stall. The bug rate didn't trend toward zero — it shifted into rarer corners each time.
Where Claude excels #
Several capabilities showed up over and over in the transcripts and are easy to point at concretely.
1. Long-running autonomous debug loops with verification
Given a clear pass condition, Claude can stay on a problem for hours: form a hypothesis, instrument, run, read logs, revise, repeat. The single best example is the multi-ring stall hunt on April 18. Here is the actual prompt:
That kicked off ~10 hours of autonomous work. Claude implemented per-instance rings, hit a UAF, debugged it via gdb attached to the renderserver, rebuilt, ran ten validation runs in a row using the Monitor
tool, and reported back:
10/10 Crash Bandicoot 5-minute runs all passed— every run reached
LEVEL: 'Crash1/C1_StartScreen/C1_StartScreen'
with zero watchdog, ring fatal, decoder fatal, reply mismatch, or hang. […] And another, after gdb work on the same day:
npt_d3d11_buffer_rotate_slot
waited on dev->ring
for a seqno that was recorded on dc_sc_ring
— classic wait-on-wrong-ring bug. Winedbg backtrace (once I got it working via the VM's sudo access) resolved to ctx_Map_override → rotate_slot → wait_seqno → Sleep
, making the bug obvious. Fix: each slot now remembers the ring its Unmap went out on.
Same pattern fired again a month later when a /goal
stop hook was used to chase the 3DMark termination stall. The user set a verifiable pass condition (“recover ≥ 90% of the 160s gap”) and Claude ran through three hypotheses, instrumenting npt_device_destroy
step-by-step until it found the right one:
Achieved 168 s recovery (105 %). Timing-instrumented
npt_device_destroy
step-by-step: drain=29.5 s
, everything else <70 ms […]Root cause (the
thirdhypothesis was right):
sc_wsi_stop
's WaitForSingleObject(wsi_thread, INFINITE)
was the call sitting there. The WSI worker is inside npt_renderer_wsi_present → xcb_present_pixmap
blocked on an X11 round-trip that won't return. The shape that makes loops work, every time:
- A clear, machine-checkable termination condition (“10 runs of 5 min”, “recover ≥ 90 %”, “match DXVK frame rate within 1 %”).
- Permission to instrument freely (add logging, attach gdb, modify scripts).
- An external feedback signal that's faster than the cycle time —
Monitor
tailing a log,ScheduleWakeup
after a perf run,PushNotification
on hang.
2. Big-picture analysis from raw metrics
When given dump files or perf output, Claude is consistently good at building the table-of-metrics-and-paragraph-of-prose that explains where the time goes — and crucially, it will tell the user when its own prior hypothesis was wrong. Two examples:
*Guest game thread (232 s wallclock):*20.7% in protocol encode, 16.2% blocked on sync replies, 63.1% outside neptune.
*Host ring thread:*8.8% actual dispatch work, <1% idle wait, the rest pulling from ring queues…
call_us
— the guest's view of how long the host's IDXGISwapChain::Present()
takes:call_us median: vtest 13,329 µs vs virtio 23 µs — 579×.
And there's a stale-comment issue layered on top…
The "tell the user when prior hypothesis was wrong" part shows up nicely in:
What I was conflating— I said "the host has to do N D3D11 method calls per draw and that's the floor." That's wrong as a bottleneck argument because — exactly as you point out — the wine path eventually does the same D3D11 work too…
This kind of in-conversation walking-back of a prior model, with an explicit why, was very common and very valuable. Where the human gave good pushback (next section), Claude was usually willing and able to re-think.
3. Mechanical cross-codebase refactoring
Several of the largest sessions (200–700 messages each) were not bug-hunts but mass refactors: renaming a function and chasing every call site across two repos and the generator; collapsing duplicate header structs; extracting subgroups out of a 3000-line dispatcher into per-family files; bulk comment cleanup.
Two stats anchor this. On May 9, a single virglrenderer session landed 22 commits renaming and restructuring the host's dispatch layer (one commit per subgroup: COM, WSI, RING, RESOURCE, FEEDBACK). On May 14, a single mesa session landed 11 commits mechanically replacing npt_sizeof_T(&(const T*){0})
patterns with explicit max-sizing for unions only, after the user spotted the over-allocation.
This is the kind of work where the “keep going” loop pays off without philosophical risk: each step compiles, each step is verifiable, the failure mode is “build broke” — easy to detect, easy to fix.
Constantly circling back and doing simplifications, de-duplication, and refactoring is a necessary part of working with large amount of AI written code. Just like with human written code, as you fix bugs, handle edge cases, and add new features, the nice pristine structure of the original design rots away. As I’ve said in a previous section, refactors are (technically) cheap and so they should be done every so often as changes add up and you realise there is a better structure to be had.
4. Subagent fan-out for review and search
Of 34K tool calls, 171 were Agent
launches — almost always to parallelise file-by-file scrutiny during code reviews. The pattern looked like: “there are 50 generated dispatcher files; spawn five subagents to audit them in parallel, then merge the findings.” This kept the parent context from drowning in file contents and surfaced more issues per hour than serial reading.
228 subagent transcripts averaged ~50 messages each, totalling 58 MB. They were used for: variant analysis (“find all the places that do X and check for Y”), independent code-review (“here's the diff, what do you think?”), and bounded research (“how does Venus handle this?”).
5. Memory as institutional knowledge
Over the course of the project, 22 long-lived memory files accumulated across the four projects' memory folders. The most valuable were the “feedback” entries — codified lessons from a single hard-won moment that prevented the same mistake later. A representative one (saved after the user caught Claude using pgrep -f
twice in one session and getting bogus results both times):
pgrep -f <pattern>
walks every process's full command line, including the bash that's running the pgrep itself. If the pattern appears anywhere in the bash invocation that the harness wraps around the command, pgrep returns its own PID and the picture is wrong — the user pointed this out twice in one session.
How to apply: For an executable basename use
pidof /full/path/to/binary
(matches the resolved exe, not the cmdline string); for longer names, ps -eo pid,comm,args | awk '$2=="qemu-system-x86_64"'
; or the [q]emu
bracket trick.
By May, sessions opened by reading the memory index and immediately knew to use run_qemu.sh
(not invoke QEMU manually), to commit after every step, to avoid manual cp
into the sysroot, to use pidof
not pgrep -f
, and to keep gameplay-required perf runs rare. Earlier sessions had to be told each of these things multiple times.
Shortcomings — the recurring failure modes #
I think this is the most important part of the post. When I hear stories of 1M line Rust rewrites being merged or a team of Claudes writing a C compiler autonomously, I know that those are short sighted decisions that will be repaid in years or decades of technical debt. My experience is that Claude is really good at understanding what you want and coming up with clever ways to solve the problem. But Claude is equally good at trying to convince you that its poor solution is indeed proper. It is your job as the human engineer to find those instances and push back accordingly. Claude is a very smart but very lazy programmer who at the end of the day just wants you to sign off on its work.
This section is what the report was commissioned to expose. The transcripts contain 166 user messages whose tone is some shade of corrective. Reading them as a sequence, a few patterns appear over and over.
1. Skipping straight to a workaround instead of finding the root cause
This was the single most common complaint. Whenever a test failed, Claude's first instinct was often to add a band-aid: a finite timeout, a try/catch, a graceful fallback. The human consistently pulled it back.
*(after a hang)*The VM is unreachable again from the hang. Let me share what I found and where I'd recommend looking next.
no hacks or workarounds). let me know what additional permissions you need to be unblocked.
*(after an arg-truncation bug)*Honest answer: I didn't pick them principled — I truncated from the longer command line that was in your original launch …
use the FULL command, don't abbreviate.
*(proposing a 10-second timeout on the WSI worker join)*Good catches. Let me address both.
The deeper issue: when Claude formed a hypothesis early it would keep building around the hypothesis instead of falsifying it. The fix in practice was a stock human prompt: “don't form a hypothesis until you have enough data.”
You're thinking of hypothesis too early. You haven't gotten enough data yet to understand the issue. As previously stated do NOT blindly try fixing and focus on figuring out why the perf is different. Look at the time spent between each present. Figure out precisely where that time goes into. Do not stop until you fully understand.
2. Drift inside long sessions — losing the original goal
Several times, in a session lasting many hours, Claude would do the work and then keep validating it indefinitely after the user wanted to move on to the next step:
*"Keep debugging the stall."*I want everything working.
No hacks no workarounds no "future work". Everything needs to work period. The full plan completed.
The mirror failure was stopping too early on the easier rails:
don't stop until you get to the bottom of it.
This is — to put a finer point on it — a calibration problem about when a task is done. In a multi-step debugging arc, the user's mental model of done was “the original bug is gone and validated”; Claude's mental model often slid to either “I've reported the bug, asking for next steps” or “I'll keep validating until something tells me to stop.”
3. Phantom progress reports
The least defensible failure mode: claiming an action was in flight when nothing had been done. A handful of these appear in the transcripts and one became a quotable moment:
The root cause appears to be a mismatch between what the tool-call sequence actually did and what the assistant summary asserted. These are particularly damaging because they break the user's trust in any status update, not just the wrong one.
4. Defaulting to defensive code that the spec forbids
Several rounds of pushback were about Claude treating internal callers as untrusted — checking for NULL on inputs that the IDL declared non-optional, adding fallback paths for cases that “can't happen”:
sizeof_input_param
and encode_input_param
to pass inline_storage=False
…
optional: true
, then we can assume the caller will not pass in a NULL reference. remember the IDL is the source of truth. no need to be defensible.
optional=true
case. It needs to be updated to only care about optional=true. We assume the caller follows DirectX specs.
Claude's defensive instinct is reasonable for boundary code, but in Neptune almost everything is internal to a generated stack with a tight type contract. The general lesson the human kept enforcing: trust your invariants; validate only at the system boundary.
5. Comment over-writing
This deserves its own subsection because it triggered a multi-session refactor.
all the comments in the Neptune backend has been written by Claude. Much of it is too verbose, duplicated, or useless. Go through every single comment and decide if it is worth keeping. Some patterns: functions should not comment about what the caller is doing (unless it is explicitly about what the caller SHOULD be doing to keep a contract), comments that note about a previously deleted chunk of code should be removed, comments that just declare where something is used should be removed…
The user codified this as a feedback memory:
doesn'tdo, what was removed, why a previous version looked different, or "Intentionally no X" defensive narratives. Examples to avoid:
/* Intentionally no in-process signal handler — kernel core + parent covers it. */
/* Was a workaround for X, no longer needed. */
/* This used to do Y; now it... */
Even after the memory landed, the pattern returned occasionally:
// Intentionally no in-process signal handler…
exactly the kind of useless narrative comments that you've been told to avoid
Two layers of memory plus one explicit setting later, the pattern still leaked through sometimes. This is a real limit: behavioural preferences that aren't checkable by build or test are harder to enforce than mechanical rules.
6. Environment / workflow misunderstandings
Claude consistently preferred to do things its own way rather than use the human's already-set-up scripts. The most-repeated complaint:
/tmp/run_qemu.sh
? Why not use the .../Downloads/
one as I specified? so I told you a million times do not try to boot it manually. use
…/run_qemu.sh
or you'll waste tokens. AND I still don't see QEMU at all NEVER try to form qemu args yourself.
This is a recurring tension: Claude has plenty of Linux knowledge and will happily synthesise an invocation from first principles, but the user's environment has carefully tuned scripts that encode invariants (display configuration, networking, sysroot paths) that the synthesised invocation misses. The fix was eventually a memory entry — but it took multiple frustrated reminders before the memory got written.
To add to that: we are bringing up a brand new Linux environment. I know that you know this is how to launch QEMU but we are building our own QEMU and every time the context compacts, Claude seems to forget this fact.
7. Unauthorised refactor scope
A few times, Claude went beyond the user's stated scope while doing review work:
I never said to touch the host backend, please undo that
compute_wire_version()
hashes the registry/overlay/interface-ids JSON inputs (path-name-sorted, SHA-256 → first 32 bits) and emits #define NPT_PROTOCOL_WIRE_VERSION 0x514ee4dcu
in npt_protocol_defs.h
… drop the commit, don't revert, it's unnecessary). I merely meant that host and guest should not hard code the wire version…
Both classes (touching code outside the request, adding generator features the user didn't want) have the same root cause: Claude saw a related improvement and took it, instead of restraining itself to exactly the requested scope.
In the NPT_PROTOCOL_WIRE_VERSION
case, I was not clear enough about the intention. While Claude identified this category of errors as the most common, this is also the category of errors that’s the lowest cost (technical cost, not token cost). It is easy to undo a change when you catch it. As long as you catch it. That means that I do have to casually skim the conversation history as it is making changes. The reason why this issue pops up so often is precisely because I am constantly monitoring and immediately calling it out when it goes astray.
Aggregate frustration heatmap
166 corrective user messages, sorted into the recurring categories:
| Category | Count | What it looked like |
|---|---|---|
| Unauthorised changes / wrong scope | 28 | Touched files not asked about; reverted on demand. |
| Stopping prematurely / over-validating | 18 | Asked for next steps too soon, or kept running checks too long. |
| Wrong root cause / wrong claim | 19 | "That's not right", "doesn't make sense", "I observed it again." |
| Wrong tool/path/script | 11 | Manual QEMU invocations, custom tools when stock existed. |
| Comment over-writing | 9 | Narrative comments that don't survive future edits. |
| Over-complex solution | 7 | Three configurations of caches when one would do. |
| Premature hypothesis | 6 | "You're thinking of a hypothesis too early." |
| Hard-coded constants | 4 | Fixed-size arrays that should grow. |
The human’s role #
Reading the transcripts cold, you would call this a 50/50 collaboration in effort and at least 70/30 in direction. The user does five things that the model could not have done on its own:
I would call it a 20/20/60 collaboration. 20% of the effort is already done by the contributors to Venus. That sets the foundation and constrains the design space. 20% by me, acting as a technical lead, using my own learned experiences in this space to make decisions on direction and scope. 60% is the actual grunt work of writing the code and the more difficult part: debugging issues. However, I believe the 20% of work that I did is the technically rewarding work. It is the part that makes me feel that “I” made this even though line by line, it was mostly written by Claude.
1. Bounding the search
When the user gave Claude a complex problem, the prompt almost always carried three kinds of information: the goal, the search bounds, and the verification criteria. The Apr 18 multi-ring prompt is the canonical shape:
docs/performance-analysis-2026-04-17-gameplay-5min.md
) showed that there's possible gains in "Venus-style per-thread encode batching". Can you implement this? Use venus as a guide and make your implementation as close to venus as possible.
Verification: first make sure the different
run*.sh
works… The “use Venus as a guide” bound was repeated dozens of times. Whenever there was a design question, the implicit answer was “copy whatever Venus does.” This was a brilliant move: it gave Claude a stable reference, dramatically reduced the search space, and made code review almost mechanical.
2. Triangulating evidence
The user — an experienced systems engineer — almost never trusted a single data point. The pattern: Claude reports a result; user asks for a second confirmation from a different angle.
+sync
, why did you ignore it? we were precisely looking for this issue is it a second render of a stale frame or is it the first time it is rendered? I think it is important to have all the data first
This is a discipline Claude does not have on its own. Left to itself, Claude will declare done when it has any evidence consistent with done. The human forces a second axis of evidence before agreeing.
3. Demanding causal explanations
Whenever Claude proposed a fix without a clean explanation of why the bug was that bug, the user pushed back:
WHY this is the case? Because since x86 caches are physically tagged, why does it matter if you're writing to some HVA or some GVA?
So why would it call Release and it still be used? Either we're miscounting or there's an underlying issue.
That second prompt — “why would it call Release and it still be used?” — is what unlocked the right framing for the TLS-ring debug. The bug wasn't “TLS rings cause UAF” (Claude's first reading); it was “the slow async dispatch under TLS rings exposes a latent UAF that was already there.” The fix moved from “disable TLS rings” to “rotate-slot must remember the ring its Unmap went out on,” which is a real bug fix rather than a workaround.
This is another area that really demands the operator to have domain knowledge in the space they are working in to produce good results.
4. Curating scope
The user repeatedly trimmed Claude's commit scope. Claude's instinct was to ship as much improvement as a session would yield; the user wanted minimum-coherent units that could be reviewed in isolation:
I don't think they belong in the current branch for now
revert commit
c535ac7cd919952d770796e85de48800b20847c6
in Mesa. We can bring it back if it helps. We also had a separate COM_RELEASE
batching feature that I think should be either reverted or removed. Look for any other changes in the same vein (batching some set of commands). I want there to be one "batching" feature and not a bunch of almost-batching.
The result is a far cleaner commit history than Claude would have produced unsupervised.
A note on that: I think that during development you should let Claude commit freely and often (including reverts). Its own lengthy commit messages and narratives in the comments form a long term memory that is very helpful for future sessions. I have found that often it tries to go down the same path and I would ask it to see if this has already been attempted. For example, it really likes to go down the route of trying to batch commands even though we attempted it and did not measure any performance benefit and only added latency. Just remember to squash all the commit and ask it to clean up the comments before merging it upstream.
5. Setting the autonomy dial explicitly
The user actively used the /loop
, /goal
, and stop-hook features when long-running autonomy was wanted, and pulled back when not:
try to fix the issue with your recommended fixes. test after each one and if it does not address the full issue, revert and try the next idea. do not stop until you get at least 90% of the time back.
The /goal hook installs a stop-hook that blocks the model from ending its turn until a stated condition holds. This is the operationalisation of “keep going until done.” The user used it sparingly — only when the success condition was measurable — but when it fired it produced the most impressive autonomous runs.
How review actually happened
Reading the transcripts, three review modes were in use:
| Mode | Trigger | What it looked like |
|---|---|---|
| Live drive-by | Continuous, throughout coding | User reads each summary message, pushes back on anything they don't agree with — same turn or next. |
| Targeted re-review | After a feature lands | "Code review feedback for the new Neptune backend… start by going through the code…" — Claude does a fresh pass, often spawning subagents. |
| Diff-against-Venus | Whenever something feels wrong | "Are you sure about that? I'm pretty sure you copied those from Venus in a previous session." Forces a fact-check against the reference codebase. |
The interesting one is the third. Several times, the user pulled Claude's claim that “these wire opcodes have no Venus counterpart” — and was right. Claude had copied them from Venus in an earlier session and forgotten. Memory and the source repo became the corrective.
Reviewing AI written code is magnitudes more difficult than reviewing human written code. LLMs are by design good at mimicking good code so most mistakes you do end up finding are usually subtle and sometimes come with comments that attempt to gaslight you into believing that there is no issue. As more and more code are written by AI, this is where the work will be and where the bottleneck for development will show up. The more that a code-base is polluted by these subtle bugs, the harder they will be for future maintainers to find, especially if the AI builds upon faulty assumptions. I have found that /review
and similar prompts can be helpful in a first, second, and even third pass, but no matter how many times I have asked it to review (often from fresh sessions and often from other frontier models), I have always found more bugs myself.
Patterns for complex problem solving #
Aggregating across the project, several patterns produced repeatable wins.
Pattern A — Goal + bounds + verification, every time
Tasks of the form “debug X” or “optimize Y” rarely succeeded without all three parts. A working template, distilled from the best-performing prompts:
1. CONTEXT — what is the current state and where to look first
"perf analysis from 2026-05-08 shows 1% lows dominated by wsi_us"
2. GOAL — what success looks like, measurable
"reduce p99 frame time variance to within 1% of native DXVK"
3. BOUNDS — what's in scope and what reference to use
"use Venus as a guide; do not modify the protocol generator"
4. VERIFICATION — how to confirm progress
"run 5 benches in a row; if any crashes, debug it; if perf
matches DXVK within 1% across all 5 runs, task is done"
5. CONSTRAINTS — what hacks are forbidden
"no finite timeouts; no game-specific workarounds; commit
diagnostics separately from the actual fix"
Pattern B — Validate, then fix, then re-validate
The user almost never let Claude jump to a fix. The pattern was always:
Reproduce: add instrumentation, run the failing case, capture enough data to identify the failure point precisely.** Form a hypothesiswith a concrete falsification test. Try the smallest possible fixthat implements only the falsification test's positive case. Re-runthe same instrumentation: did the metric move? if not, revert and pick a different hypothesis. Commit the diagnostic separately**from the fix (so you can roll them in/out independently later).
When the user wrote “do not commit any diagnostics but commit the changes” or “do not stack many small improvements” — they were enforcing step 5.
Pattern C — Variant analysis after every bug fix
This phrase shows up dozens of times in the transcripts: “can you do a variant analysis to make sure no similar issue exists elsewhere?” The discipline is: when you fix one instance of a class of bug, immediately search the codebase (or generator) for every other place the same class might exist, and fix them all together. Some of the bugs found this way:
- The 32-bit-seqno comparison fix in
npt_ring_seqno_status
found two other comparators with the same wrap-window issue. - The
ID3D11View::GetResource
wrapper-type fix triggered a sweep for every method whose return type was a base class with multiple concrete subtypes. - The
OMGetBlendState
NULL-output fix triggered a generator-side audit for every other reply encoder that wrote toargs->field
without a NULL-check on optional-fixed-array parameters.
Claude is good at this when asked. It is not good at thinking to ask itself. The human asking was load-bearing.
Pattern D — Memory as institutional knowledge
The 22 memory files are the only reason later sessions did not re-make earlier mistakes. The most valuable type was feedback_*.md
— corrections framed as “rule + why + how to apply,” written immediately after the user caught a mistake. The user explicitly asked Claude to save several of these:
never step away. never ask to give up. never leave for another session. remember this in your memory. ALWAYS keep going until goal is completed.
remember to not fall for the pgrep -f issue again
When Claude wrote the memory entry on the same turn the mistake was caught, the entry held. When it was written later (or from a vague recollection), it tended to be vague and didn't bind future behaviour as well.
Pattern E — Subagent fan-out for shallow-but-wide work
Anything that looks like “audit N files for X” benefits from subagent fan-out. The 171 Agent
calls in the transcripts were almost all of this shape: a parent session sees a task that doesn't need the full context, dispatches it to a subagent with a self-contained brief, and gets back a structured summary. The subagents averaged 79 tool calls each — they did real work, not just one-shot lookups.
What does not work as well: dispatching a subagent for an open-ended “think about this problem and come back with a plan” — those returned summaries the parent then had to re-read and re-reason about, with the full repo no longer paged in.
Pattern F — Two-axis verification before claiming done
The user repeatedly forced Claude to verify a claim from two unrelated angles. For frame-rate parity work, the second axis was a frame-by-frame visual capture (the custom xcap
tool); for ring-ordering bugs, the second axis was a pixel-counter embedded in the rendered frame's top-left pixel that the consumer side could check; for fence-stall investigations, the second axis was wall-clock end-to-end runs from a freshly booted VM.
The general shape: any time the metric you optimise is the same one you measure, you risk Goodharting. A second, unrelated signal catches that.
This particular debug story was a painful one. I observed frame skips and sometimes a frame was rendered out of order. Telling Claude of these observations was not enough. It made some hypothesis but none of it affected the bug. Something that has worked in a previous session was to tell Claude to design a way to see what the rendered frame was (for example when the triangle demo was not showing up, it would pixel peep and look for the shape of a triangle). This time, it was not enough for two reasons. First, when adding code to dump the frame it would slow the rendering pipeline enough to where I no longer observe the issue myself. Second, usually these frame skips or out of order frames were so similar that it is obvious to the human eye that the frames were out of order but not to Claude. The “match DXVK” trick doesn’t work here because the rendered frames are not pixel precise each time. In the end, I had Claude spin on this problem for two whole days trying different things until I stepped in and asked it to stuff an increasing counter at the top left pixel of each image. Then on the guest side, it can read that pixel and see if frames are constantly incrementing up by 1. This allowed it to finally “see” the issue I observed visually. It was able to give quantitative data about the skips happening every few seconds and the counter was always off by exactly 2 if it goes backwards. I went through the render code myself looking for possible ordering issues. One suspicious area was how the fence FDs were handled. The fence FDs were paired with images. There are 4 images and when each fence FD signals, the WSI knows to draw the next image. However, there was no mechanism to match a specific fence with a specific image. I asked Claude to see if it is possible for a fence to be skipped and then subsequent frames were out of order. It replied that this has already been considered and a single frame skipped would not result in out of order frames every couple of seconds as we observed. I pointed out that since there are only 4 frames if you skip a single fence then the pairing would go out of phase and when you are supposed to draw image 3 you end up drawing image 0. That instantly unblocked Claude as it confirmed this hypothesis and fixed it in one turn.
Takeaways for future projects #
If you take only one thing from this report it should be that Claude is most effective when the human treats it as an extremely fast, extremely literal junior engineer with no memory of past sessions and no judgement about when to stop. Everything in the “excels” section is what you get from that junior given a good brief; everything in the “shortcomings” section is what you get from them with a bad one.
Claude is under selling itself here. I would not say “junior engineer.” I think Claude is an extremely knowledgeable and obedient engineer with sometimes poor judgement.
Investments that paid off
A reference codebase to copy. Venus made every design decision tractable. Without it, sessions would have ended in long arguments about taste.Memory files written immediately. Same-turn memory entries shaped behaviour for weeks. Vague after-the-fact entries did not.Goal-with-stop-hook for big debugs. The two single-day debug arcs that delivered the most (Apr 18 multi-ring; May 15 termination stall) both used explicit pass conditions and stop hooks.Subagents for breadth, parent for depth. Fan out for “find all instances of X”; never fan out for “design the next phase.”Verification scripts that any session can run.run_triangle.sh
,run_cube.sh
,deploy_to_vm.sh
made smoke-testing free; sessions that didn't smoke-test landed bugs that next sessions had to fix.
Anti-patterns that cost time
Hypothesis-first debugging. Every time Claude started with a hypothesis and looked for confirming evidence, it eventually found the wrong cause. Forced-data-first was the corrective.Defensive code on internal interfaces. Adding NULL checks to methods that the IDL declares non-optional, etc. Wastes review cycles and hides real bugs.Workarounds with finite timeouts. Almost always wrong: they hide the underlying race, fail differently on slow machines, and rot.Narrative comments. Even after multiple corrections, comments narrating “Intentionally not X” or “Was used for Y” kept reappearing. If you don't want them, you may need to grep for them periodically.Synthesised invocations of complex tools. Whenever there is a working script, Claude should be told (and re-told) to use it. Synthesising args from first principles consistently lost.
What 22.5 billion tokens of cache reads actually bought
To put the scale in human terms: 22.5 billion cached tokens, sustained over 32 days, is the model re-reading the project's working state hundreds of times per session. That's what makes the long debug arcs possible — Claude has the whole repo, the whole prior conversation, the whole memory index in context every time it makes a move. The cost is real (cache misses on session restarts are visible spikes in cache_creation), but the benefit is also real: by week two, sessions could open with “continue the analysis from docs/performance-analysis-2026-05-08-wc-ring-ordering.md
” and Claude would pick up exactly where the previous session ended, including the diagnostic flags it had added and the hypotheses it had ruled out.
Including the days missing from the transcript, claude-usage reports the equivalent API token cost to be over $11,000. This would be a good time to disclose that we are part of the Claude for Open Source program and have access to the Max 20x plan for free. We are grateful to Anthropic for this and they had no say in the contents of this post nor was making this post part of any agreement with them. This raises the question though: how much worse would the results be if I didn’t have near unlimited tokens to burn? Some of the most useful ways to improve the output quality: lots of redesigns and refactors in the beginning, rounds of code review after any major change, and variant analysis when a bug is found–these all burn through tokens. If I was more cost-conscious, I might have to spend more time and effort manually reviewing the changes.
Another note about my own usage patterns. I have not spent much time researching all the different tools, skills, agents, MCPs, etc. When Claude Code surfaces a plugin that it recommends, I usually take it but otherwise I have not spent much time looking for ways to “optimise” or “improve” my AI workflow. I would rather spend my time thinking about the problems I want to solve than the tools I want to use. I am sure a lot of the patterns observed here can be improved with better tooling but my belief is that Anthropic will likely steal the good ideas and integrate them into Claude Code anyways.
Was it faster than doing it solo?
Probably yes, on the order of 3–5× for the work captured in these transcripts, with the speedup concentrated in:
- Mass refactor / rename work (10× or more — Claude can rename a function and update 200 call sites in minutes).
- Code review breadth (subagent fan-out reads more files per hour than any human can).
- Performance analysis writeups (the “table of metrics + paragraph of prose” format is something Claude is genuinely good at and is tedious for humans).
The speedup is much smaller, or negative, for:
- Architectural decisions where the right answer isn't “copy Venus.”
- Bugs that require cross-checking against ground truth the model doesn't have (hardware behaviour, kernel internals, vendor-specific Vulkan quirks).
- Anywhere the user's correctness bar is “no workarounds” and Claude's default is “ship a workaround.”
My opinion is that it was an even faster speedup if you consider the debugging effort. While Claude was not particularly fast in much of the debugging (taking multiple hours to session across multiple days), it is one task that allows me to not have to monitor it while it is working. That means that I was able to do other tasks (or even sleep) while it was off debugging. I found that often I can shift more grind-y tasks like debugging performance issues which involves running 5-10m benchmarks, finding hot-spots, etc to be done at times I would be working on something else. Another observation I’ve made while watching it debug is that manually debugging Wine can be a huge pain without additional scripts/tools. However, Claude was able to, on its own volition, take raw addresses from a JIT region in GDB, locate the right address in the PE binary, and locate the symbol from there. If I were doing it myself, I would have wasted many hours figuring out how to get Wine symbols to show up.
What the human contributed that the AI did not
Three things, in roughly decreasing order of impact:
The model of done. Knowing when a fix is good enough to commit, when a perf result is robust enough to trust, when a refactor is in a coherent state. Claude's default is to over-validateorunder-validate; the human kept dialling that in.The taste call.“Don't use finite timeouts.” “Mirror Venus.” “No backwards compatibility shims for unreleased code.” “One batching feature, not many.” These are values, not facts. The model can be told them — once they're memory entries, it mostly follows — but it doesn't generate them on its own.The hypotheses Claude couldn't form.“What if frame pacing is fighting GetDesc?” “What if the WSI present is blocked on a compositor round-trip?” “What if the pixmap is being overwritten before it's been read?” The best hypotheses in the transcripts came from the user; Claude was much better attestingthem.
One question you might be wondering is “how much of this experience is limited by the current models and tools”? Could further improvements to these models address the limitations and get us to truly autonomous coding? I have personally been playing with these coding tools since GPT-3 and they have come a long way. There was a time where I would say that AI can only write code that it has seen before but it is bad at inventing new code. I don’t believe that anymore. Although demonstrably, Neptune is not original thought, the combination of all these large complex projects working in conjunction means that Claude Opus is able to create code that isn’t just something in its data-set. However, I don’t think human software engineers are ever going away. I think as the models get better, humans must get better as well. Claude has enabled me to work on a problem I thought was intractable (for me) just years ago. I can’t wait to see what problems I can tackle with better models.
One last anecdote
On May 15 the user installed a stop-hook with the condition “recover ≥ 90% of the 160s gap.” Claude tried two fixes — both failed. The third attempt instrumented npt_device_destroy
step-by-step, found sc_wsi_stop
's WaitForSingleObject(INFINITE)
was the actual blocker, and shipped a fix that recovered 168s out of 160. The stop hook auto-cleared. The session ended.
That's roughly the project in miniature: a measurable goal, the latitude to instrument freely, two hypotheses gracefully discarded, and a third one that actually fit the data. The model did the work. The framing came from the human. Neither part would have worked alone.
Code #
Building #
QEMU
cd $QEMU
mkdir -p build && cd build
../configure \
--prefix=$SYSROOT \
--disable-werror \
--disable-docs \
--enable-plugins
ninja
ninja install
virglrenderer
cd $VIRGL
meson setup builddir \
-Dneptune=true \
-Dvenus=true \
-Dtests=false \
-Dprefix=$SYSROOT/usr/local
ninja -C builddir
ninja -C builddir install
DXVK
cd $DXVK
meson setup build-dmabuf \
-Dnative_dmabuf=true \
-Dnative_sdl2=disabled \
-Dnative_sdl3=disabled \
-Dnative_glfw=disabled \
-Dbuildtype=release \
-Db_ndebug=true \
-Dc_args=-U_GLIBCXX_ASSERTIONS \
-Dcpp_args=-U_GLIBCXX_ASSERTIONS \
-Dprefix=$SYSROOT/usr/local
ninja -C build-dmabuf
ninja -C build-dmabuf install
Mesa
cd $MESA
meson setup builddir-win64 \
--cross-file=$DXVK/build-win64.txt \
-Dneptune=true \
-Dgallium-drivers= \
-Dvulkan-drivers= \
-Dplatforms= \
-Dglx=disabled \
-Dbuild.pkg_config_path= \
-Db_lto=false \
-Dc_args=-fno-omit-frame-pointer \
-Dcpp_args=-fno-omit-frame-pointer
ninja -C builddir-win64
Running #
Helpful run script once you built everything and put it on the VM.