Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

Technical solution for running the LTX-2.3 audio-to-video model (22B parameters) alongside TTS and other models on a single 96GB GPU by switching from a persistent server architecture to a cold-start design. The author reduced VRAM usage by loading the Gemma-3-12B text encoder in 4-bit quantization using bitsandbytes, dropping its footprint from 22.78 GiB to 7.26 GiB, but still faced OOM issues with a total persistent footprint of ~86 GiB. The final solution uses a cold-start approach where the pipeline object is held in memory but components are built, run, and freed per request, allowing the system to idle at 0 GiB and peak at only 40 GiB during generation.

When integrating LTX-2.3 a 22B audio-to-video model into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB. Hardware: RTX Pro 6000 Blackwell Max-Q 94.97 GiB . Software: LTX-2 official repo https://github.com/Lightricks/LTX-2 and bitsandbytes 0.49.1. What I Was Trying to Do A2V audio-to-video mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses A2VidPipelineTwoStage : prompt + audio path + image ↓ stage 1 generate video latent at low resolution, audio fixed ↓ spatial upsample 2x ↓ stage 2 refinement at high resolution, distilled LoRA-384 applied ↓ video VAE decode + embed original input audio mp4 output The official pipeline builds → runs → frees each component inside every call , which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory. Dead-End 1: VRAM Breakdown in Persistent Mode Loading every LTX-2 component into VRAM at once all bf16 : | Component | VRAM | |---|---| | embeddings processor | 5.91 GiB | | Gemma3-12B text encoder | 22.78 GiB | | stage 1 transformer | 35.38 GiB | | stage 2 transformer distilled LoRA applied | 35.38 GiB | | video VAE encoder | 0.60 GiB | | audio VAE encoder | 0.04 GiB | | spatial upsampler | 0.92 GiB | | video decoder | 0.76 GiB | Total | 101.77 GiB | 102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage 2 transformer with CUDA out of memory. Tried to allocate 128.00 MiB. Dead-End 2: "Gemma Is Small" Is a Misconception My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect. The model filename is gemma-3-12b-it-qat-q4 0-unquantized . Here, qat-q4 0 means it was trained with Quantization-Aware Training for q4 0, and unquantized means the weights are stored as pre-quantization bf16. If you're using it as intended, you should load it in q4 0. Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision. Fix 1: 4-bit Loading with bitsandbytes LTX-2's Gemma loader uses transformers.Gemma3ForConditionalGeneration internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use from pretrained directly: python from transformers import BitsAndBytesConfig, Gemma3ForConditionalGeneration quant config = BitsAndBytesConfig load in 4bit=True, bnb 4bit compute dtype=torch.bfloat16, bnb 4bit use double quant=True, bnb 4bit quant type="nf4", model = Gemma3ForConditionalGeneration.from pretrained gemma root, quantization config=quant config, device map={"": "cuda:0"}, torch dtype=torch.bfloat16, ← dtype for non-quantized layers embeddings, etc. local files only=True, If you omit torch dtype , embeddings load as fp16 and clash with Linear4bit 's bnb 4bit compute dtype bf16 : mat1 and mat2 must have the same dtype, but got Half and BFloat16 . I hit that too. The patches LTX-2 applies to Gemma RoPE inv freq / embed scale / position ids register buffer still work fine — just call create and populate encoder . Since bnb quantization only replaces nn.Linear , Embedding layers and buffers pass through untouched. Result: Gemma's VRAM drops from 22.78 GiB → 7.26 GiB . That's 15 GiB freed. Dead-End 3: Even With That, Persistent Mode Can't Coexist With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated reserved 88.27 GiB, nvidia-smi shows 91 GiB . Headroom: 4 GiB. Inference workspace during generation with CFG, roughly +5 GiB blows past that, peaking at 91 GiB. Adding TTS 3.4 GiB + Ditto 3.0 GiB = 6.4 GiB on top makes OOM inevitable no matter how you slice it . Three options: - Offload TTS+Ditto voice chat unavailable while A2V runs - Keep only one transformer resident still leaves OOM risk Cold-start: build → run → free all weights per request Since I wanted to keep real-time conversation MuseTalk + TTS, TTFA ~930ms running while using LTX-2 as a "cinematic" feature, I went with option 3. Fix 2: Cold-Start Architecture The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the A2VidPipelineTwoStage instance in memory, and let the official implementation's context-manager-per-component build → run → free on every call . python class PersistentA2VPipeline: def init self, ..., cold start: bool : self.pipeline = A2VidPipelineTwoStage ... builder only, nearly zero VRAM if cold start: return done here persistent mode only: start preloading components from here def generate cold self, ... : pipeline. call handles component build/free internally video, audio = self.pipeline prompt=..., audio path=..., images=... encode video video, audio, output path, ... Since stage 1 and stage 2 run sequentially, only one transformer is in VRAM at a time. Measured peak: 39.50 GiB . After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB CUDA context only . mode cold-start: components load per-request slow first call, low idle VRAM cuda cold-start startup no preload : allocated=0.00GiB ... cuda after cold-start generate: allocated=0.01GiB peak=39.50GiB While voice chat runs TTS 3.4 + Ditto 3.0 = 6.4 GiB , LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation. Gotcha: Audio VAE Preprocessing The A2V audio VAE encoder expects a 2-channel stereo waveform, but TTS output is typically mono. Passing mono gives you expected input 1, 1, 207, 66 to have 2 channels, but got 1 channels instead from Conv2d. Also, if the input audio is shorter than num frames / frame rate , the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input. Both handled with a single ffmpeg call: mono → stereo + silence padding in one pass ffmpeg -y -i input.wav -ac 2 -af apad -t 2.041667 output.wav On the server side, check channels and duration with av , run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying. Numbers and Tradeoffs | Metric | Persistent | Cold-Start | |---|---|---| | Idle VRAM | 86 GiB | 0 GiB | | Peak VRAM during generation | 91 GiB | 40 GiB | | Time per request | ~17s inference only | ~60s including disk I/O | | TTS+Ditto coexistence | Impossible OOM | Possible | | OS page cache effect | None | ~25-30s from 2nd request onward | The cost of cold-start is disk I/O time reading 73 GB from NVMe, ~40 seconds . First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions." Strategic Role I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart out of the training bucket distribution . AI upscaling from degraded input can't restore lip-sync accuracy. The revised split: - Real-time conversation : MuseTalk + multilingual TTS TTFA ~930ms, already running - Async cinematic moments : LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around. We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at /articles https://kotonia.ai/articles/ .