# His AI Said 'Swap the PSU.' He Said 'One More Test.'

> Source: <https://dev.to/lenriqueotero/his-ai-said-swap-the-psu-he-said-one-more-test-2i7g>
> Published: 2026-06-04 07:08:11+00:00

*How a homelab engineer and his AI pair-debugger cornered an RTX 3090 that hard-reset the entire machine the instant it ran inference — and why neither of them could have solved it alone.*

The first thing Marco noticed was the silence.

Not an error. Not a kernel panic scrolling up the screen. Not even a flicker in the logs. Just — the machine, gone. One moment his homelab box was answering an embeddings request for the little self-hosted knowledge base he'd been building; the next, the fans spun down, the screen went black, and the box rebooted as if someone had yanked the cord.

He did what any engineer does. He went to read the logs.

There were none.

`journalctl`

showed a clean boot, then nothing, then the next clean boot. No `Xid`

. No `NVRM`

. No call trace. The kernel ring buffer had no last words. Whatever killed the machine had killed it so completely that the CPU never got to write a single line to disk.

A crash with no body. And it happened *every single time* he asked the GPU to think.

*Fig. II — The machine that left no record: a hard reset that wrote nothing, anywhere.*

The box was nothing exotic: a single RTX 3090, 24 GB of VRAM, Ubuntu, and a stack of local models served through `llama.cpp`

. Marco ran everything on-prem on purpose — a 26B chat model and a 4B embedding model, feeding a personal notes-and-search system he was rebuilding from scratch. The whole point was that nothing left the box.

Which meant the embeddings service *was* the project. If the GPU died every time it embedded a sentence, the project was dead too.

And there was a second, quieter problem. For a long time the machine had been perfectly stable on the **590-series** NVIDIA driver. Then a routine system update pulled the kernel forward, and the driver came with it — up to the **595 series**. The crashes started after that.

The obvious move was to roll back to 590. Marco tried. It wasn't a driver swap; it was a trapdoor. The 590 module was only ever built for a kernel two versions behind where the update had landed, and the distro had already retired the 590 branch entirely. "Going back to 590" really meant *pinning the kernel at an old release forever* — no security updates, a frozen island he could never sail off of.

He didn't want an island. He wanted his machine. So the real task wasn't "revert." It was: **make 595 work.**

Here's the part that made everything harder: Marco had killed a crash on this exact machine before.

Months earlier, the same box had a different death — it would shut *completely off* under sustained inference. That one turned out to be brutally physical. An RTX 3090 doesn't draw power smoothly; it throws microsecond transient spikes that can momentarily hit nearly twice its rated draw. The original power supply's over-current protection saw those spikes, decided something was wrong, and cut the rails. A beefier PSU fixed it for good. Afterward the box happily pulled 350 W sustained without so much as a hiccup.

That fix was **real**. The PSU genuinely was the culprit, and swapping it genuinely solved it.

But it left a fingerprint on how Marco — and later his AI — would think. "Box dies under GPU load" now had an obvious prior: *it's power again.* Usually a good instinct. This time, a trap with the safety off.

Marco had been pair-debugging with an AI agent — the kind that can read his shell, write code, edit configs, and reason about systems out loud. He'd come to treat it less like a chatbot and more like a tireless junior engineer with an encyclopedic memory and zero ego about grunt work.

The first thing it did was solve the "crash with no body" problem — by refusing to rely on the body at all.

You cannot read the logs of a machine that reboots before it can flush them. So the agent set up **out-of-band capture**: it streamed the kernel log over the network to a *second* machine via `netconsole`

— anywhere the dying box's last words might land instead of dying with it — and added a one-hertz telemetry trail (power, clocks, P-state, VRAM, PCIe), flushed to disk *every line* so the final sample before a reset couldn't be lost. It even verified the whole pipe end-to-end: test lines and a live kernel stream arrived intact at the second machine.

```
# On the box that crashes: stream kernel messages off-machine BEFORE the crash.
# netconsole=<srcport>@<src-ip>/<dev>,<dstport>@<listener-ip>/<gateway-or-listener-mac>
sudo modprobe netconsole \
  netconsole=6666@192.0.2.10/eth0,6666@192.0.2.20/aa:bb:cc:dd:ee:ff

# On the listener machine: catch it.
nc -u -l -p 6666 | tee netconsole-capture.log
# A telemetry trail that survives a hard reset: flush every sample to disk.
nvidia-smi --query-gpu=timestamp,power.draw,clocks.sm,pstate,memory.used,pcie.link.gen.current \
           --format=csv,noheader -l 1 \
| while IFS= read -r line; do
    printf '%s\n' "$line" >> /var/log/gpu-crash-trail.log
    sync   # force it to disk NOW — the box may not get another chance
  done
```

Then the agent did the second indispensable thing: it turned "random" into a **deterministic trigger**. With careful repetition it nailed down that the crash wasn't intermittent at all. It fired *reliably* on a real inference request — even a single one, even from the small 4 GB embedding model alone.

```
# The whole repro. Fire one real inference request; watch the box die.
curl -s http://127.0.0.1:8082/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input":"trigger the crash"}'
```

And here's the twist Marco didn't expect: the capture rig never caught the crash. Not once. Every time they pulled the trigger, the box died — and the `netconsole`

stream, the one they'd just *proven* worked, fell dead silent at the instant of death. The on-disk trail caught only the calm before: the GPU idling cool at 41 °C, a modest 116 W, an unremarkable P-state — and then a clean reboot, no last line.

That silence was the first real clue. A capture you've verified works, catching *absolutely nothing* at the moment of failure, isn't a failed experiment — it's a result. It meant the machine was dying faster than the CPU could write a single character: no panic, no `Xid`

, no driver complaint, because the kernel never got the chance. This wasn't software crashing the system. The system was being switched off from *below* the software — a hardware-level reset. The absence of evidence was the evidence, and it quietly demolished an entire class of theories before they began.

A reproducible crash isn't a solved crash. But it's the difference between hunting a ghost and running an experiment. For the first time, Marco felt like they were *doing science* instead of lighting candles.

*Fig. III — A capture rig, verified working, recording the silence of a hardware-level reset.*

What followed was two nights of the agent methodically walking up to every hypothesis and shooting it in the head. Marco would propose; the agent would build the test, run it against the deterministic trigger, and read the result off the out-of-band trail.

One by one, they fell:

| Hypothesis | Test | Result |
|---|---|---|
| GSP firmware bug (Ampere classic) | Disable it (`NVreg_EnableGpuFirmware=0` ) |
❌ crashed anyway |
BAR1 / VA-space exhaustion (`open-gpu-kernel-modules` #1134) |
Would emit `Xid 31/154`
|
❌ none ever captured |
| Thermal | Read junction temp at crash | ❌ died at 41 °C
|
| High core power / compute | cuBLAS burn at ~284 W | ❌ stable 6+ minutes |
| Deep-idle cold-wake | Keepalive pinned at P5 | ❌ crashed at steady P5 |
| VRAM pressure | Oversubscribe 24 GB+ | ❌ survived 30+ min |
Power magnitude
|
Cap to 100 W (firmware floor) | ❌ crashed anyway |
| Lost BIOS settings (the CMOS battery had been replaced) | Re-apply the entire power-management lever set, confirmed live |
❌ crashed anyway |

*Fig. IV — A catalogue of falsified causes: every tunable lever, ruled out one by one.*

That last one stung. The migration to the new case had reset the BIOS, and "a lost BIOS setting" was a beautiful theory — it would have explained the whole "stable, then not" arc. The agent restored every relevant lever — PCIe link speed pinned to Gen3, Resizable BAR off, ASPM off, clock gating off, C-states clamped — and confirmed each one was actually in effect. The box still hard-reset on a single embed request.

Meanwhile, every read-only health check came back *pristine*: zero PCIe replays, zero AER errors, a full Gen3 x16 link under load, no pending channel repairs, a healthy VBIOS. The card looked perfect. It just kept committing suicide whenever it thought.

Here's where the story turns, and where it gets honest.

After the gauntlet, the agent reached a conclusion — and it stated it clearly, more than once, with a genuinely solid argument behind it:

Software is exhausted. Every tunable lever has been falsified. This is a hardware fault. The next steps are physical: reseat the GPU power cables with no daisy-chaining, then swap or test the PSU, then cross-test the card in another machine.

And you can see why. They'd ruled out firmware, thermals, VRAM, BIOS, and raw power level. The crash left no software trace. The machine *had* a real power-delivery fault in its past. Pattern-match complete: **it's power again, go physical.**

It was the reasonable conclusion. It was *sound given the evidence.*

It was also wrong.

This is the moment that decides these investigations. The tooling had done everything right and pointed confidently at the door marked *Buy New Hardware*. Marco's hand was on the screwdriver. Reseating cables, swapping a known-good PSU, eventually RMA'ing a card — days of work and real money, on the word of a very convincing diagnosis.

He stopped. Something nagged. The cuBLAS burn had pulled **284 watts of pure compute for six minutes and never flinched** — but a tiny embedding request, drawing a fraction of that, killed the box instantly. If this were a gross power-delivery fault, the brutal sustained burn should have been *more* dangerous, not less. The magnitude story didn't fit.

So instead of asking *"how do I fix the hardware,"* he asked a different question: **"are we even sure this is about inference being heavy? What if it's about inference being weird?"**

Not "fix it." **Characterize it.**

The agent took the new framing and ran with it — and this is the other half of the story, the half where the human alone is helpless. Marco could have the *instinct* that "heavy vs. weird" mattered. He could not, on a side project at 1 a.m., have hand-written a suite of raw-CUDA reproducers to prove it. The agent could, and did, in minutes.

The idea: strip away `llama.cpp`

entirely and probe the GPU with pure, hand-shaped CUDA workloads, each isolating one *flavor* of load.

```
// burn.cu — sustained, SMOOTH cuBLAS SGEMM. ~280W of clean compute.
// Build: nvcc -O3 burn.cu -lcublas -o burn
#include <cublas_v2.h>
#include <cuda_runtime.h>

int main() {
    const int n = 8192;
    size_t bytes = (size_t)n * n * sizeof(float);
    float *A, *B, *C;
    cudaMalloc(&A, bytes); cudaMalloc(&B, bytes); cudaMalloc(&C, bytes);

    cublasHandle_t h; cublasCreate(&h);
    const float alpha = 1.f, beta = 0.f;

    for (long i = 0; ; ++i) {            // run until it crashes — or you give up waiting
        cublasSgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
                    &alpha, A, n, B, n, &beta, C, n);
        cudaDeviceSynchronize();
    }
}
```

They built variations: sustained bursty SGEMM with 90-watt swings; a *cold-burst* version that slammed from warm-idle to full power with razor-sharp `di/dt`

edges; a PCIe stress test that saturated the bus at 12.7 GB/s. Each one targeted a specific bogeyman — power swings, current slew rate, bus traffic.

Every one of them **survived**.

Then the decisive pair. The agent ran the *real* embedding model two ways. The difference was a single flag — `--n-gpu-layers`

, how much of the model lives on the GPU versus shuttling between CPU and GPU per layer:

| Workload | Peak power | Result |
|---|---|---|
| Bursty SGEMM (smooth compute) | 277 W | ✅ survives |
| Cold-burst, sharp di/dt | 122 → 278 W | ✅ survives |
| PCIe saturation | 12.7 GB/s | ✅ survives |
Embedding, full offload (`-ngl 99` ) |
286 W | ✅ survives, 60 requests, rock-solid |
Embedding, partial offload (`-ngl 20` ) |
low | ❌ instant reset, every time
|

There it was. Same card. Same model. *Higher* power on the stable config. The only thing that reliably killed the machine was **partial offload** — and it died on a single request while drawing far less than the burn that ran for six minutes.

*Fig. V — The shape of the load, not its size: full offload runs smooth; partial offload stutters and kills the box.*

The signature finally made sense.

Partial offload doesn't produce a heavy, smooth load. It produces a *stutter*: the GPU spins up to compute one small per-layer operation, then stalls waiting on a PCIe round-trip to fetch the next layer from system RAM, then spins up again — a high-frequency sawtooth of micro-bursts synced to bus latency. Full offload keeps the whole model resident and runs a smooth, continuous stream. Pure cuBLAS is smoother still.

So the killer wasn't power *magnitude* (the 100 W cap crashed; 286 W didn't). It wasn't generic `di/dt`

(the cold-burst test was sharper and survived). It wasn't PCIe bandwidth (saturation survived). It wasn't compute. It was the specific **waveform** of partial-offload inference — fine compute micro-bursts interleaved with PCIe-synced idle, a pattern nothing else they'd thrown at the card reproduced.

Most likely: either a marginal, almost *resonant* power-delivery transient that only that particular stutter excites — or a power-management bug in the 595 driver around rapid P-state transitions on Ampere. Possibly both, feeding each other. The "swap the PSU" verdict wasn't crazy; it was just aimed at the wrong layer. The PSU could deliver the *energy* fine. It was the *shape* of the demand that nobody's hardware or firmware liked.

The workaround fell straight out of the diagnosis: **never run the model in partial offload.** Put the whole embedding model on the GPU and don't co-load the big chat model that had been forcing the cramped VRAM budget in the first place.

```
- llama-server --model embed.gguf --n-gpu-layers 20 ...   # partial → the crash waveform
+ llama-server --model embed.gguf --n-gpu-layers 99 ...   # full offload → smooth, stable
```

The validation wasn't shy. The agent fired sustained, concurrent embedding load at the server: **2,272 requests, zero failures, peak 312 watts, not a single crash.** Higher power than the workload that used to nuke the box on request one — and it didn't even blink. Marco enabled the service permanently.

No PSU swap. No RMA. No frozen-island kernel downgrade. The machine stayed on the maintained 595 driver, and the project that depended on it came back to life.

*Fig. VI — 2,272 requests at 312 W, without fault.*

The root cause is still not *nailed*. The workaround sidesteps the failure mode; it doesn't explain it down to the transistor. The clean next experiment — boot the known-good 590 driver and re-run the partial-offload trigger — would finally separate "595 driver bug" from "marginal hardware transient." Marco left it for another night. A characterized, validated workaround that lets him keep shipping beats a perfect post-mortem he doesn't have time to write.

That's allowed. Engineering is not forensics. Sometimes "I know exactly how to avoid it and I've proven the avoidance holds 2,272 times" is the right place to stop.

It's tempting to tell this as "the AI cracked a hard bug." It's also tempting to tell it as "the AI was wrong and the human saved the day." Both are too neat. The truth is more interesting, and more useful.

**Without the agent, this was unsolvable on a side project.** The out-of-band capture rig, the deterministic trigger, the entire falsification gauntlet, and — above all — a suite of bespoke raw-CUDA reproducers conjured on demand at midnight: that is days of specialist work, compressed into hours, executed without fatigue or shortcuts. No solo hobbyist grinds through all of that. Most would have swapped parts on a guess and either gotten lucky or given up.

**And without the human, the agent would have bought a power supply it didn't need.** Its "it's hardware, go physical" verdict was well-reasoned and confidently delivered — and premature. What broke the case wasn't more analysis. It was a human refusing the confident conclusion, sitting with a detail that didn't fit (six minutes at 284 W, but dead on a 4 W embed?), and changing the *question* from *fix it* to *characterize it.*

The agent was the instrument: precise, tireless, encyclopedic. The human was the one who held the line when the instrument pointed at the wrong door. Neither half solves this. The pairing does.

*Fig. VII — Neither alone: the instrument and the judge.*

`printk`

. Don't waste days grepping `dmesg`

for a fault that never gets to speak.`printk`

can reach.**Setup**

`595.71.05`

(proprietary). Last known-good: `6.17.0-22`

and the branch is retired, so reverting means pinning an old kernel permanently (no security updates).`llama.cpp`

(`llama-server`

)`--n-gpu-layers 20`

(**Symptom**

`Xid`

, `NVRM`

, panic, MCE, or PCIe AER. A verified-working `netconsole`

+ fsync'd on-disk telemetry caught nothing at the crash ⇒ hardware-level reset; the CPU never reaches a `printk`

.**Diagnostic — falsified, each with direct evidence**

`NVreg_EnableGpuFirmware=0`

) → still crashed.`open-gpu-kernel-modules`

#1134) → would emit `Xid 31/154`

; none ever captured. `pci=realloc=off`

couldn't shrink BAR1 (kernel forces 32 GB at enumeration).`Xid`

, `Channel Repair Pending: No`

.`-ngl 99`

`-ngl 20`

**Conclusion**

**Workaround**

```
- llama-server --model embed.gguf --n-gpu-layers 20 ...   # partial offload → crashes
+ llama-server --model embed.gguf --n-gpu-layers 99 ...   # full offload → stable
```

*Originally published on dev.to.*
