{"slug": "running-ltx-2-3-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start", "title": "Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture", "summary": "Technical solution for running the LTX-2.3 audio-to-video model (22B parameters) alongside TTS and other models on a single 96GB GPU by switching from a persistent server architecture to a cold-start design. The author reduced VRAM usage by loading the Gemma-3-12B text encoder in 4-bit quantization using bitsandbytes, dropping its footprint from 22.78 GiB to 7.26 GiB, but still faced OOM issues with a total persistent footprint of ~86 GiB. The final solution uses a cold-start approach where the pipeline object is held in memory but components are built, run, and freed per request, allowing the system to idle at 0 GiB and peak at only 40 GiB during generation.", "body_md": "When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.\n\nHardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: [LTX-2 official repo](https://github.com/Lightricks/LTX-2) and bitsandbytes 0.49.1.\n\n## What I Was Trying to Do\n\nA2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses `A2VidPipelineTwoStage`\n\n:\n\n```\nprompt + audio_path + image\n   ↓ stage_1 (generate video latent at low resolution, audio fixed)\n   ↓ spatial upsample 2x\n   ↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)\n   ↓ video VAE decode + embed original input audio\nmp4 output\n```\n\nThe official pipeline builds → runs → frees each component inside every `__call__`\n\n, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.\n\n## Dead-End 1: VRAM Breakdown in Persistent Mode\n\nLoading every LTX-2 component into VRAM at once (all bf16):\n\n| Component | VRAM |\n|---|---|\n| embeddings processor | 5.91 GiB |\n| Gemma3-12B text encoder | 22.78 GiB |\n| stage_1 transformer | 35.38 GiB |\n| stage_2 transformer (distilled LoRA applied) | 35.38 GiB |\n| video VAE encoder | 0.60 GiB |\n| audio VAE encoder | 0.04 GiB |\n| spatial upsampler | 0.92 GiB |\n| video decoder | 0.76 GiB |\nTotal |\n101.77 GiB |\n\n102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with `CUDA out of memory. Tried to allocate 128.00 MiB.`\n\n## Dead-End 2: \"Gemma Is Small\" Is a Misconception\n\nMy intuition was \"a 12B text encoder can't be that heavy\" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.\n\nThe model filename is `gemma-3-12b-it-qat-q4_0-unquantized`\n\n. Here, `qat-q4_0`\n\nmeans it was trained with Quantization-Aware Training for q4_0, and `unquantized`\n\nmeans the weights are stored as pre-quantization bf16. **If you're using it as intended, you should load it in q4_0.** Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.\n\n## Fix 1: 4-bit Loading with bitsandbytes\n\nLTX-2's Gemma loader uses `transformers.Gemma3ForConditionalGeneration`\n\ninternally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use `from_pretrained`\n\ndirectly:\n\n``` python\nfrom transformers import BitsAndBytesConfig, Gemma3ForConditionalGeneration\n\nquant_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_compute_dtype=torch.bfloat16,\n    bnb_4bit_use_double_quant=True,\n    bnb_4bit_quant_type=\"nf4\",\n)\nmodel = Gemma3ForConditionalGeneration.from_pretrained(\n    gemma_root,\n    quantization_config=quant_config,\n    device_map={\"\": \"cuda:0\"},\n    torch_dtype=torch.bfloat16,  # ← dtype for non-quantized layers (embeddings, etc.)\n    local_files_only=True,\n)\n```\n\nIf you omit `torch_dtype`\n\n, embeddings load as fp16 and clash with `Linear4bit`\n\n's `bnb_4bit_compute_dtype`\n\n(bf16): `mat1 and mat2 must have the same dtype, but got Half and BFloat16`\n\n. I hit that too.\n\nThe patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call `create_and_populate(encoder)`\n\n. Since bnb quantization only replaces `nn.Linear`\n\n, Embedding layers and buffers pass through untouched.\n\nResult: Gemma's VRAM drops from **22.78 GiB → 7.26 GiB**. That's 15 GiB freed.\n\n## Dead-End 3: Even With That, Persistent Mode Can't Coexist\n\nWith Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, `nvidia-smi`\n\nshows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes **OOM inevitable no matter how you slice it**.\n\nThree options:\n\n- Offload TTS+Ditto (voice chat unavailable while A2V runs)\n- Keep only one transformer resident (still leaves OOM risk)\n**Cold-start: build → run → free all weights per request**\n\nSince I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a \"cinematic\" feature, I went with option 3.\n\n## Fix 2: Cold-Start Architecture\n\nThe key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the `A2VidPipelineTwoStage`\n\ninstance in memory, and let the official implementation's context-manager-per-component build → run → free on every `__call__`\n\n.\n\n``` python\nclass PersistentA2VPipeline:\n    def __init__(self, ..., cold_start: bool):\n        self.pipeline = A2VidPipelineTwoStage(...)  # builder only, nearly zero VRAM\n        if cold_start:\n            return  # done here\n        # persistent mode only: start preloading components from here\n\n    def _generate_cold(self, ...):\n        # pipeline.__call__ handles component build/free internally\n        video, audio = self.pipeline(prompt=..., audio_path=..., images=...)\n        encode_video(video, audio, output_path, ...)\n```\n\nSince stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: **39.50 GiB**. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).\n\n```\n[mode] cold-start: components load per-request (slow first call, low idle VRAM)\n[cuda] cold-start startup (no preload): allocated=0.00GiB\n...\n[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB\n```\n\nWhile voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.\n\n## Gotcha: Audio VAE Preprocessing\n\nThe A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you `expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead`\n\nfrom Conv2d.\n\nAlso, if the input audio is shorter than `num_frames / frame_rate`\n\n, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.\n\nBoth handled with a single ffmpeg call:\n\n```\n# mono → stereo + silence padding in one pass\nffmpeg -y -i input.wav -ac 2 -af apad -t 2.041667 output.wav\n```\n\nOn the server side, check channels and duration with `av`\n\n, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.\n\n## Numbers and Tradeoffs\n\n| Metric | Persistent | Cold-Start |\n|---|---|---|\n| Idle VRAM | 86 GiB | 0 GiB |\n| Peak VRAM during generation | 91 GiB | 40 GiB |\n| Time per request | ~17s (inference only) | ~60s (including disk I/O) |\n| TTS+Ditto coexistence | Impossible (OOM) | Possible |\n| OS page cache effect | None | ~25-30s from 2nd request onward |\n\nThe cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for \"one cinematic shot every 1-2 minutes\" or \"inserted at scene transitions.\"\n\n## Strategic Role\n\nI originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.\n\nThe revised split:\n\n-\n**Real-time conversation**: MuseTalk + multilingual TTS (TTFA ~930ms, already running) -\n**Async cinematic moments**: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable\n\nThe cold-start design only makes sense under the premise that \"the wait is part of the production value.\" That's what this architecture is built around.\n\nWe're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at [/articles](https://kotonia.ai/articles/).", "url": "https://wpnews.pro/news/running-ltx-2-3-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/running-ltx-23-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start-architecture-2ee3", "published_at": "2026-05-22 11:23:07+00:00", "updated_at": "2026-05-22 11:34:27.170893+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "hardware", "research"], "entities": ["LTX-2.3", "TTS", "Ditto", "MuseTalk", "RTX Pro 6000 Blackwell Max-Q", "bitsandbytes", "Gemma-3-12b-it-qat-q4_0-unquantized", "NVIDIA"], "alternates": {"html": "https://wpnews.pro/news/running-ltx-2-3-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start", "markdown": "https://wpnews.pro/news/running-ltx-2-3-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start.md", "text": "https://wpnews.pro/news/running-ltx-2-3-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start.txt", "jsonld": "https://wpnews.pro/news/running-ltx-2-3-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start.jsonld"}}