His AI Said 'Swap the PSU.' He Said 'One More Test.' A homelab engineer known as Marco spent weeks debugging an RTX 3090 that hard-reset his entire machine every time it ran inference, leaving no logs behind. The GPU crash, triggered by a routine NVIDIA driver update from the 590 to 595 series, appeared identical to a previous power-supply failure — but the real culprit was something entirely different. Marco eventually solved the mystery by pair-debugging with an AI agent that set up out-of-band kernel log capture via `netconsole`, streaming the dying machine's last words to a second host before the reset could erase them. How a homelab engineer and his AI pair-debugger cornered an RTX 3090 that hard-reset the entire machine the instant it ran inference — and why neither of them could have solved it alone. The first thing Marco noticed was the silence. Not an error. Not a kernel panic scrolling up the screen. Not even a flicker in the logs. Just — the machine, gone. One moment his homelab box was answering an embeddings request for the little self-hosted knowledge base he'd been building; the next, the fans spun down, the screen went black, and the box rebooted as if someone had yanked the cord. He did what any engineer does. He went to read the logs. There were none. journalctl showed a clean boot, then nothing, then the next clean boot. No Xid . No NVRM . No call trace. The kernel ring buffer had no last words. Whatever killed the machine had killed it so completely that the CPU never got to write a single line to disk. A crash with no body. And it happened every single time he asked the GPU to think. Fig. II — The machine that left no record: a hard reset that wrote nothing, anywhere. The box was nothing exotic: a single RTX 3090, 24 GB of VRAM, Ubuntu, and a stack of local models served through llama.cpp . Marco ran everything on-prem on purpose — a 26B chat model and a 4B embedding model, feeding a personal notes-and-search system he was rebuilding from scratch. The whole point was that nothing left the box. Which meant the embeddings service was the project. If the GPU died every time it embedded a sentence, the project was dead too. And there was a second, quieter problem. For a long time the machine had been perfectly stable on the 590-series NVIDIA driver. Then a routine system update pulled the kernel forward, and the driver came with it — up to the 595 series . The crashes started after that. The obvious move was to roll back to 590. Marco tried. It wasn't a driver swap; it was a trapdoor. The 590 module was only ever built for a kernel two versions behind where the update had landed, and the distro had already retired the 590 branch entirely. "Going back to 590" really meant pinning the kernel at an old release forever — no security updates, a frozen island he could never sail off of. He didn't want an island. He wanted his machine. So the real task wasn't "revert." It was: make 595 work. Here's the part that made everything harder: Marco had killed a crash on this exact machine before. Months earlier, the same box had a different death — it would shut completely off under sustained inference. That one turned out to be brutally physical. An RTX 3090 doesn't draw power smoothly; it throws microsecond transient spikes that can momentarily hit nearly twice its rated draw. The original power supply's over-current protection saw those spikes, decided something was wrong, and cut the rails. A beefier PSU fixed it for good. Afterward the box happily pulled 350 W sustained without so much as a hiccup. That fix was real . The PSU genuinely was the culprit, and swapping it genuinely solved it. But it left a fingerprint on how Marco — and later his AI — would think. "Box dies under GPU load" now had an obvious prior: it's power again. Usually a good instinct. This time, a trap with the safety off. Marco had been pair-debugging with an AI agent — the kind that can read his shell, write code, edit configs, and reason about systems out loud. He'd come to treat it less like a chatbot and more like a tireless junior engineer with an encyclopedic memory and zero ego about grunt work. The first thing it did was solve the "crash with no body" problem — by refusing to rely on the body at all. You cannot read the logs of a machine that reboots before it can flush them. So the agent set up out-of-band capture : it streamed the kernel log over the network to a second machine via netconsole — anywhere the dying box's last words might land instead of dying with it — and added a one-hertz telemetry trail power, clocks, P-state, VRAM, PCIe , flushed to disk every line so the final sample before a reset couldn't be lost. It even verified the whole pipe end-to-end: test lines and a live kernel stream arrived intact at the second machine. On the box that crashes: stream kernel messages off-machine BEFORE the crash. netconsole=