Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe

wpnews.pro

cd /news/large-language-models/fitting-whisperx-large-v3-a-24b-llm-… · home › topics › large-language-models › article

[ARTICLE · art-19854] src=dev.to ↗ pub=2026-06-03T03:35Z topic=large-language-models verified=true sentiment=· neutral

Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe

A developer successfully ran both WhisperX large-v3 (7.7GB) and a 24B parameter LLM (Devstral Small 2) simultaneously on a single 24GB RTX 3090 by reducing the LLM's context window from 40,960 to 8,192 tokens. This cut the KV cache from 6.1GB to 1.25GB, bringing total VRAM usage to 21.9GB and eliminating CUDA out-of-memory errors that occurred when both services overlapped. The fix uses the model's native 8K context length, which covers all real-world triage prompts without quality degradation from rope extrapolation.

read5 min views16 publishedJun 3, 2026

This is the technical, reproducible version of a fix I shipped on my own homelab. If you want the narrative version, that's on Medium. This one is the recipe: the measurements, the math, the Modelfile, and the exact prompt I gave Claude Code to generate it. Copy-paste friendly.

Repo for the dashboard used throughout: https://github.com/SikamikanikoBG/homelab-monitor

18.3 + 7.7 = 26GB

→ CUDA OOM whenever they overlapped.num_ctx

to 8192 → KV cache drops from ~6.1GB to ~1.25GB → model footprint ~18.3GB → 14.2 + 7.7 = 21.9GB

→ both resident, zero OOM, no quality loss.

Host:    openSUSE, Xeon (56 threads), 125GB RAM, 1x RTX 3090 (24GB)
GPU svc: WhisperX large-v3  (speech-to-text)
GPU svc: Ollama -> devstral-small-2 (24B, Q4_K_M) for background email triage

Both services run all the time. The OOM only happened when I dictated to my assistant (WhisperX) while the triage loop was active.

nvidia-smi

shows instantaneous VRAM. It can't show you which service spiked or when two of them overlapped — and an intermittent OOM is a timing problem. You need per-service VRAM history.

I use my own dashboard (homelab-monitor) for this. The relevant view is "AI Models", which attributes VRAM per model server and per loaded model, over a time range, with OOM markers and a capacity ceiling line.

What the history showed at the overlap window:

Service	Peak VRAM
Devstral 24B (triage)	~18.3 GB
WhisperX large-v3	7.7 GB
Total
~26 GB on a 24 GB card

If you want to reproduce the measurement, the dashboard runs as a single container:

git clone https://github.com/SikamikanikoBG/homelab-monitor
cd homelab-monitor
docker compose up -d --build

(NVIDIA Container Toolkit required for GPU metrics. Remote hosts are monitored over SSH, no agent.)

Weights are a fixed cost (~15GB for Devstral 24B at Q4_K_M). The variable cost is the KV cache, which scales linearly with num_ctx

. So the question is: how much context does background email triage actually use?

I pulled the request traces from Langfuse. The triage pipeline:

Real prompts never exceeded ~5–8k tokens. The model was loaded with a 40k window — ~32k tokens of reserved KV cache doing nothing.

Devstral Small is mistral3

. Pull the architecture straight from Ollama:

curl -s http://localhost:11434/api/show -d '{"name":"devstral-small-2:latest"}' \
  | python -c "import sys,json;mi=json.load(sys.stdin)['model_info'];\
print({k:v for k,v in mi.items() if 'head_count' in k or 'block_count' in k or 'length' in k})"

Relevant values:

block_count (layers)      = 40
attention.head_count_kv   = 8
attention.key_length      = 128
attention.value_length    = 128
context_length (native)   = 8192   # rope-extended to 393216

KV cache per token (f16) = 2 (K+V) × layers × kv_heads × head_dim × 2 bytes

2 × 40 × 8 × 128 × 2  =  163,840 bytes  ≈  0.156 MB / token

So:

num_ctx	KV cache (f16)
40,960	~6.1 GB
16,384	~2.5 GB
8,192
~1.25 GB
4,096	~0.6 GB

8192 is the sweet spot: it's above the real worst-case prompt (~5–8k) and it's the model's native context length, so there's no rope extrapolation quality hit. I rejected 4096 — a 10-email batch with 2k generation can brush up against it.

Ollama lets you inherit existing weights and override parameters in a Modelfile, so this costs no extra disk and no re-download.

Modelfile.triage

FROM devstral-small-2:latest

PARAMETER num_ctx 8192
PARAMETER temperature 0
PARAMETER num_predict 2048

SYSTEM """You are a background email-triage engine. Follow the exact output
format in each request. Output only the requested label(s) or field(s). Never
add explanations, preamble, or commentary. When uncertain, pick the closest
valid option. Be terse and deterministic."""

Build it:

ollama create devstral-small-2:triage -f Modelfile.triage

The optional SYSTEM

block is a small bonus: triage prompts want terse, structured output, and pinning that behaviour cuts stray preamble (fewer reparse/retry calls = less GPU time).

I let Claude Code do the measuring and the Modelfile generation. The prompt, roughly:

Analyze my background email triage. Pull the Langfuse traces to find the real prompt/context sizes the triage job uses, decide a safe

num_ctx

cap that won't truncate worst-case batches, confirm the KV-cache savings against the model's actual architecture, and generate an Ollama Modelfile for a context-capped:triage

variant. Then tell me the expected VRAM footprint.

It came back with: traces show ≤8k tokens, cap at 8192 (native window), ~5GB KV saved, expected footprint ~14–16GB. Which matched what the dashboard measured after I deployed it.

curl -s http://localhost:11434/api/generate \
  -d '{"model":"devstral-small-2:triage","prompt":"ping","stream":false}' >/dev/null

curl -s http://localhost:11434/api/ps \
  | python -c "import sys,json;[print(m['name'],round(m['size_vram']/1e9,1),'GB ctx',m['context_length']) for m in json.load(sys.stdin)['models']]"

Result: the triage model holds ~14GB resident at ctx=8192

, down from ~18GB.

Before	After
Triage LLM	~18.3 GB	~14.2 GB
WhisperX large-v3	7.7 GB	7.7 GB
Combined
~26 GB → OOM
~21.9 GB → fits

Both services now sit on the card together. Full STT quality, email triage in parallel, ~2GB headroom. No quant change, no CPU offload, no smaller Whisper.

nvidia-smi

can't diagnose it — get per-service VRAM num_ctx

to the real workload.Dashboard used for the per-service VRAM history: ** https://github.com/SikamikanikoBG/homelab-monitor** — it's open source, runs in one container, and exists because I needed exactly this view and

nvidia-smi

wouldn't give it to me.

source & further reading

dev.to — original article Your Agent Bills While It Waits. Here's the Fix. GPT-5.6 Closed a 30-Year Math Gap. Nobody Noticed. AI Agent Data Minimization: Give Tools Less Context Without Breaking Results

~/api · this article 200

$curl api.wpnews.pro/v1/news/fitting-whisperx-large-v…

Read original on dev.to → dev.to/sikamikanikobg/fitting-whisperx-large-v3-…

mentioned entities

WhisperX

Ollama

devstral-small-2

Claude Code

SikamikanikoBG

RTX 3090

NVIDIA

metadata

slugfitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevGetting Started with Vector Data…

next →The next AI coding bottleneck is…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 19 Jul · #large-language-models

I built a local, keyless Firecrawl for Claude Code — here's why published: false

tobiasreithmeier.de · 19 Jul · #large-language-models

Tokensave: An MCP Server That Saved Me Tokens While Coding

dev.to · 19 Jul · #large-language-models

From Prompt Engineering to Autonomous AI Systems

dev.to · 19 Jul · #large-language-models

Building Production-Grade Semantic Search with GPT-5 and Microsoft Foundry, From Scratch

── more on @whisperx 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 18 Jul · #artificial-intelligence

DeepSeek API tests show Claude-like behavior under selected prompts

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required