I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

wpnews.pro

cd /news/large-language-models/i-spent-two-weeks-optimizing-96gb-of… · home › topics › large-language-models › article

[ARTICLE · art-35122] src=dev.to ↗ pub=2026-06-20T21:24Z topic=large-language-models verified=true sentiment=· neutral

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

A developer spent two weeks optimizing a homelab with four RTX 3090s (96GB VRAM) for local LLM inference, achieving improvements like 40% throughput gain and 4x VRAM savings, but ultimately found that paid APIs were more cost-effective for interactive work due to low GPU utilization (6%) caused by sequential dispatch in llama.cpp. The developer concluded local setups are better for privacy, batch jobs, or uncensored experimentation, but not as a general cloud replacement.

read1 min views1 publishedJun 20, 2026

I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it my daily driver for local LLM inference instead of paying for cloud APIs. I got it working. Then I looked at the numbers and subscribed to a paid API anyway.

Here's the uncomfortable part, and the optimizations that still made it worth doing.

The setup #

The 6% problem #

The wall wasn't compute. GPU utilization sat at 6%. The bottleneck was CPU orchestration — llama.cpp dispatches across multiple GPUs sequentially, so the cards spent 94% of the time idle waiting on each other. Throwing more VRAM at it does nothing for this.

What actually moved the needle #

| Change | Effect |

|---|---|

| `--ubatch-size 512`

| +40% throughput |

| KV cache quantization (Q4_0) | 4× VRAM savings |

| Speculative decoding (n-gram) | 2.5× speedup on repetitive tasks |

| YaRN rope scaling | context extended to 1M tokens |

Two things surprised me:

The conclusion I didn't want #

At ~11 kWh/day, plus hardware depreciation, against current API pricing, the math doesn't favor local for interactive work. The single biggest improvement to my daily AI workflow was paying for an API. Local still wins for privacy, high-volume batch jobs, or uncensored experimentation — but not as a general cloud replacement. It's an economics problem, not a capability one.

I wrote up the full cost breakdown and the exact llama.cpp router configs

[on aipster.com](https://aipster.com/four-gpus-two-weeks-and-the-uncomfortable-truth-about-local-llms/).

If you're weighing a local rig, I also benchmarked

[GLM 5.2's open weights](https://aipster.com/glm-5-2-open-weight-top-four-model-hugging-face/)

— it changed my view on what's worth running at home.

What's your GPU utilization actually sitting at? Curious if anyone solved the sequential-dispatch problem.

source & further reading

dev.to — original article Supercharge your web app with free AI that runs in your users' browser From the factory floor to AI developer: tools that run in my own plant Day 9 of building an AI agent that controls a phone.

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-spent-two-weeks-optimi…

Read original on dev.to → dev.to/azaiats/i-spent-two-weeks-optimizing-96gb…

mentioned entities

RTX 3090

llama.cpp

NVIDIA

aipster.com

GLM 5.2

Hugging Face

YaRN

Q4_0

metadata

slugi-spent-two-weeks-optimizing-96gb-of-vram-for-local-llms-paid-apis-still-won

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevFrom the factory floor to AI dev…

next →Supercharge your web app with fr…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 20 Jun · #large-language-models

The AI Hardware Stack Is Being Rebuilt From the Wafer Up

vettedconsumer.com · 20 Jun · #large-language-models

Qwen3-30B-A3B: The Open Model Most People Should Actually Run

letsdatascience.com · 20 Jun · #large-language-models

KKR Forms Helix Unit to Build AI Infrastructure

porchlab.com · 20 Jun · #large-language-models

The best stack for the AI Era

── more on @rtx 3090 3 stories trending now

wpnews · 19 Jun · #artificial-intelligence

From Dream Job to 'The Gulag': Inside Staff Revolt Zuckerberg's Brutal AI Push

wpnews · 19 Jun · #artificial-intelligence

Stop Guessing Which Library to Use — I Built an AI Capability Discovery Engine

wpnews · 19 Jun · #artificial-intelligence

Joanna Stern spent one week with new Siri AI, and it’s very good

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required