{"slug": "fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping", "title": "Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe", "summary": "A developer successfully ran both WhisperX large-v3 (7.7GB) and a 24B parameter LLM (Devstral Small 2) simultaneously on a single 24GB RTX 3090 by reducing the LLM's context window from 40,960 to 8,192 tokens. This cut the KV cache from 6.1GB to 1.25GB, bringing total VRAM usage to 21.9GB and eliminating CUDA out-of-memory errors that occurred when both services overlapped. The fix uses the model's native 8K context length, which covers all real-world triage prompts without quality degradation from rope extrapolation.", "body_md": "This is the technical, reproducible version of a fix I shipped on my own homelab. If you want the narrative version, that's on Medium. This one is the recipe: the measurements, the math, the Modelfile, and the exact prompt I gave Claude Code to generate it. Copy-paste friendly.\n\nRepo for the dashboard used throughout: [https://github.com/SikamikanikoBG/homelab-monitor](https://github.com/SikamikanikoBG/homelab-monitor)\n\n`18.3 + 7.7 = 26GB`\n\n→ CUDA OOM whenever they overlapped.`num_ctx`\n\nto 8192 → KV cache drops from ~6.1GB to ~1.25GB → model footprint ~18.3GB → `14.2 + 7.7 = 21.9GB`\n\n→ both resident, zero OOM, no quality loss.\n\n```\nHost:    openSUSE, Xeon (56 threads), 125GB RAM, 1x RTX 3090 (24GB)\nGPU svc: WhisperX large-v3  (speech-to-text)\nGPU svc: Ollama -> devstral-small-2 (24B, Q4_K_M) for background email triage\n```\n\nBoth services run all the time. The OOM only happened when I dictated to my assistant (WhisperX) *while* the triage loop was active.\n\n`nvidia-smi`\n\nshows instantaneous VRAM. It can't show you *which* service spiked or *when* two of them overlapped — and an intermittent OOM is a timing problem. You need per-service VRAM history.\n\nI use my own dashboard (homelab-monitor) for this. The relevant view is \"AI Models\", which attributes VRAM per model server and per loaded model, over a time range, with OOM markers and a capacity ceiling line.\n\nWhat the history showed at the overlap window:\n\n| Service | Peak VRAM |\n|---|---|\n| Devstral 24B (triage) | ~18.3 GB |\n| WhisperX large-v3 | 7.7 GB |\nTotal |\n~26 GB on a 24 GB card |\n\nIf you want to reproduce the measurement, the dashboard runs as a single container:\n\n```\ngit clone https://github.com/SikamikanikoBG/homelab-monitor\ncd homelab-monitor\ndocker compose up -d --build\n# open http://<host>:9800  -> AI Models / GPU views\n```\n\n(NVIDIA Container Toolkit required for GPU metrics. Remote hosts are monitored over SSH, no agent.)\n\nWeights are a fixed cost (~15GB for Devstral 24B at Q4_K_M). The variable cost is the **KV cache**, which scales linearly with `num_ctx`\n\n. So the question is: how much context does background email triage actually use?\n\nI pulled the request traces from Langfuse. The triage pipeline:\n\nReal prompts never exceeded ~5–8k tokens. The model was loaded with a 40k window — ~32k tokens of reserved KV cache doing nothing.\n\nDevstral Small is `mistral3`\n\n. Pull the architecture straight from Ollama:\n\n```\ncurl -s http://localhost:11434/api/show -d '{\"name\":\"devstral-small-2:latest\"}' \\\n  | python -c \"import sys,json;mi=json.load(sys.stdin)['model_info'];\\\nprint({k:v for k,v in mi.items() if 'head_count' in k or 'block_count' in k or 'length' in k})\"\n```\n\nRelevant values:\n\n```\nblock_count (layers)      = 40\nattention.head_count_kv   = 8\nattention.key_length      = 128\nattention.value_length    = 128\ncontext_length (native)   = 8192   # rope-extended to 393216\n```\n\nKV cache per token (f16) = `2 (K+V) × layers × kv_heads × head_dim × 2 bytes`\n\n:\n\n```\n2 × 40 × 8 × 128 × 2  =  163,840 bytes  ≈  0.156 MB / token\n```\n\nSo:\n\n| num_ctx | KV cache (f16) |\n|---|---|\n| 40,960 | ~6.1 GB |\n| 16,384 | ~2.5 GB |\n8,192 |\n~1.25 GB |\n| 4,096 | ~0.6 GB |\n\n8192 is the sweet spot: it's above the real worst-case prompt (~5–8k) **and** it's the model's native context length, so there's no rope extrapolation quality hit. I rejected 4096 — a 10-email batch with 2k generation can brush up against it.\n\nOllama lets you inherit existing weights and override parameters in a Modelfile, so this costs no extra disk and no re-download.\n\n`Modelfile.triage`\n\n:\n\n```\nFROM devstral-small-2:latest\n\n# Native 8K window: covers every triage prompt (10-email batches + 2K generation)\n# while keeping the KV cache ~1.25GB so the model + WhisperX fit on one 24GB GPU.\nPARAMETER num_ctx 8192\nPARAMETER temperature 0\nPARAMETER num_predict 2048\n\nSYSTEM \"\"\"You are a background email-triage engine. Follow the exact output\nformat in each request. Output only the requested label(s) or field(s). Never\nadd explanations, preamble, or commentary. When uncertain, pick the closest\nvalid option. Be terse and deterministic.\"\"\"\n```\n\nBuild it:\n\n```\nollama create devstral-small-2:triage -f Modelfile.triage\n```\n\nThe optional `SYSTEM`\n\nblock is a small bonus: triage prompts want terse, structured output, and pinning that behaviour cuts stray preamble (fewer reparse/retry calls = less GPU time).\n\nI let Claude Code do the measuring and the Modelfile generation. The prompt, roughly:\n\nAnalyze my background email triage. Pull the Langfuse traces to find the real prompt/context sizes the triage job uses, decide a safe\n\n`num_ctx`\n\ncap that won't truncate worst-case batches, confirm the KV-cache savings against the model's actual architecture, and generate an Ollama Modelfile for a context-capped`:triage`\n\nvariant. Then tell me the expected VRAM footprint.\n\nIt came back with: traces show ≤8k tokens, cap at 8192 (native window), ~5GB KV saved, expected footprint ~14–16GB. Which matched what the dashboard measured after I deployed it.\n\n```\n# load it\ncurl -s http://localhost:11434/api/generate \\\n  -d '{\"model\":\"devstral-small-2:triage\",\"prompt\":\"ping\",\"stream\":false}' >/dev/null\n\n# check resident VRAM + context\ncurl -s http://localhost:11434/api/ps \\\n  | python -c \"import sys,json;[print(m['name'],round(m['size_vram']/1e9,1),'GB ctx',m['context_length']) for m in json.load(sys.stdin)['models']]\"\n```\n\nResult: the triage model holds ~14GB resident at `ctx=8192`\n\n, down from ~18GB.\n\n| Before | After | |\n|---|---|---|\n| Triage LLM | ~18.3 GB | ~14.2 GB |\n| WhisperX large-v3 | 7.7 GB | 7.7 GB |\nCombined |\n~26 GB → OOM |\n~21.9 GB → fits |\n\nBoth services now sit on the card together. Full STT quality, email triage in parallel, ~2GB headroom. No quant change, no CPU offload, no smaller Whisper.\n\n`nvidia-smi`\n\ncan't diagnose it — get per-service VRAM `num_ctx`\n\nto the real workload.Dashboard used for the per-service VRAM history: ** https://github.com/SikamikanikoBG/homelab-monitor** — it's open source, runs in one container, and exists because I needed exactly this view and\n\n`nvidia-smi`\n\nwouldn't give it to me.", "url": "https://wpnews.pro/news/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping", "canonical_source": "https://dev.to/sikamikanikobg/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping-recipe-22g0", "published_at": "2026-06-03 03:35:14+00:00", "updated_at": "2026-06-03 03:41:56.496528+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-infrastructure", "ai-tools", "mlops"], "entities": ["WhisperX", "Ollama", "devstral-small-2", "Claude Code", "SikamikanikoBG", "RTX 3090", "NVIDIA"], "alternates": {"html": "https://wpnews.pro/news/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping", "markdown": "https://wpnews.pro/news/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping.md", "text": "https://wpnews.pro/news/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping.txt", "jsonld": "https://wpnews.pro/news/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping.jsonld"}}