Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe A developer successfully ran both WhisperX large-v3 (7.7GB) and a 24B parameter LLM (Devstral Small 2) simultaneously on a single 24GB RTX 3090 by reducing the LLM's context window from 40,960 to 8,192 tokens. This cut the KV cache from 6.1GB to 1.25GB, bringing total VRAM usage to 21.9GB and eliminating CUDA out-of-memory errors that occurred when both services overlapped. The fix uses the model's native 8K context length, which covers all real-world triage prompts without quality degradation from rope extrapolation. This is the technical, reproducible version of a fix I shipped on my own homelab. If you want the narrative version, that's on Medium. This one is the recipe: the measurements, the math, the Modelfile, and the exact prompt I gave Claude Code to generate it. Copy-paste friendly. Repo for the dashboard used throughout: https://github.com/SikamikanikoBG/homelab-monitor https://github.com/SikamikanikoBG/homelab-monitor 18.3 + 7.7 = 26GB → CUDA OOM whenever they overlapped. num ctx to 8192 → KV cache drops from ~6.1GB to ~1.25GB → model footprint ~18.3GB → 14.2 + 7.7 = 21.9GB → both resident, zero OOM, no quality loss. Host: openSUSE, Xeon 56 threads , 125GB RAM, 1x RTX 3090 24GB GPU svc: WhisperX large-v3 speech-to-text GPU svc: Ollama - devstral-small-2 24B, Q4 K M for background email triage Both services run all the time. The OOM only happened when I dictated to my assistant WhisperX while the triage loop was active. nvidia-smi shows instantaneous VRAM. It can't show you which service spiked or when two of them overlapped — and an intermittent OOM is a timing problem. You need per-service VRAM history. I use my own dashboard homelab-monitor for this. The relevant view is "AI Models", which attributes VRAM per model server and per loaded model, over a time range, with OOM markers and a capacity ceiling line. What the history showed at the overlap window: | Service | Peak VRAM | |---|---| | Devstral 24B triage | ~18.3 GB | | WhisperX large-v3 | 7.7 GB | Total | ~26 GB on a 24 GB card | If you want to reproduce the measurement, the dashboard runs as a single container: git clone https://github.com/SikamikanikoBG/homelab-monitor cd homelab-monitor docker compose up -d --build open http://