# Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

> Source: <https://dev.to/shinji_shimizu_bb51276a5e/running-ltx-23-alongside-tts-on-a-single-96gb-gpu-with-a-cold-start-architecture-2ee3>
> Published: 2026-05-22 11:23:07+00:00

When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.

Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: [LTX-2 official repo](https://github.com/Lightricks/LTX-2) and bitsandbytes 0.49.1.

## What I Was Trying to Do

A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses `A2VidPipelineTwoStage`

:

```
prompt + audio_path + image
   ↓ stage_1 (generate video latent at low resolution, audio fixed)
   ↓ spatial upsample 2x
   ↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)
   ↓ video VAE decode + embed original input audio
mp4 output
```

The official pipeline builds → runs → frees each component inside every `__call__`

, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.

## Dead-End 1: VRAM Breakdown in Persistent Mode

Loading every LTX-2 component into VRAM at once (all bf16):

| Component | VRAM |
|---|---|
| embeddings processor | 5.91 GiB |
| Gemma3-12B text encoder | 22.78 GiB |
| stage_1 transformer | 35.38 GiB |
| stage_2 transformer (distilled LoRA applied) | 35.38 GiB |
| video VAE encoder | 0.60 GiB |
| audio VAE encoder | 0.04 GiB |
| spatial upsampler | 0.92 GiB |
| video decoder | 0.76 GiB |
Total |
101.77 GiB |

102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with `CUDA out of memory. Tried to allocate 128.00 MiB.`

## Dead-End 2: "Gemma Is Small" Is a Misconception

My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.

The model filename is `gemma-3-12b-it-qat-q4_0-unquantized`

. Here, `qat-q4_0`

means it was trained with Quantization-Aware Training for q4_0, and `unquantized`

means the weights are stored as pre-quantization bf16. **If you're using it as intended, you should load it in q4_0.** Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.

## Fix 1: 4-bit Loading with bitsandbytes

LTX-2's Gemma loader uses `transformers.Gemma3ForConditionalGeneration`

internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use `from_pretrained`

directly:

``` python
from transformers import BitsAndBytesConfig, Gemma3ForConditionalGeneration

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = Gemma3ForConditionalGeneration.from_pretrained(
    gemma_root,
    quantization_config=quant_config,
    device_map={"": "cuda:0"},
    torch_dtype=torch.bfloat16,  # ← dtype for non-quantized layers (embeddings, etc.)
    local_files_only=True,
)
```

If you omit `torch_dtype`

, embeddings load as fp16 and clash with `Linear4bit`

's `bnb_4bit_compute_dtype`

(bf16): `mat1 and mat2 must have the same dtype, but got Half and BFloat16`

. I hit that too.

The patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call `create_and_populate(encoder)`

. Since bnb quantization only replaces `nn.Linear`

, Embedding layers and buffers pass through untouched.

Result: Gemma's VRAM drops from **22.78 GiB → 7.26 GiB**. That's 15 GiB freed.

## Dead-End 3: Even With That, Persistent Mode Can't Coexist

With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, `nvidia-smi`

shows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes **OOM inevitable no matter how you slice it**.

Three options:

- Offload TTS+Ditto (voice chat unavailable while A2V runs)
- Keep only one transformer resident (still leaves OOM risk)
**Cold-start: build → run → free all weights per request**

Since I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a "cinematic" feature, I went with option 3.

## Fix 2: Cold-Start Architecture

The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the `A2VidPipelineTwoStage`

instance in memory, and let the official implementation's context-manager-per-component build → run → free on every `__call__`

.

``` python
class PersistentA2VPipeline:
    def __init__(self, ..., cold_start: bool):
        self.pipeline = A2VidPipelineTwoStage(...)  # builder only, nearly zero VRAM
        if cold_start:
            return  # done here
        # persistent mode only: start preloading components from here

    def _generate_cold(self, ...):
        # pipeline.__call__ handles component build/free internally
        video, audio = self.pipeline(prompt=..., audio_path=..., images=...)
        encode_video(video, audio, output_path, ...)
```

Since stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: **39.50 GiB**. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).

```
[mode] cold-start: components load per-request (slow first call, low idle VRAM)
[cuda] cold-start startup (no preload): allocated=0.00GiB
...
[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB
```

While voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.

## Gotcha: Audio VAE Preprocessing

The A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you `expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead`

from Conv2d.

Also, if the input audio is shorter than `num_frames / frame_rate`

, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.

Both handled with a single ffmpeg call:

```
# mono → stereo + silence padding in one pass
ffmpeg -y -i input.wav -ac 2 -af apad -t 2.041667 output.wav
```

On the server side, check channels and duration with `av`

, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.

## Numbers and Tradeoffs

| Metric | Persistent | Cold-Start |
|---|---|---|
| Idle VRAM | 86 GiB | 0 GiB |
| Peak VRAM during generation | 91 GiB | 40 GiB |
| Time per request | ~17s (inference only) | ~60s (including disk I/O) |
| TTS+Ditto coexistence | Impossible (OOM) | Possible |
| OS page cache effect | None | ~25-30s from 2nd request onward |

The cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions."

## Strategic Role

I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.

The revised split:

-
**Real-time conversation**: MuseTalk + multilingual TTS (TTFA ~930ms, already running) -
**Async cinematic moments**: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable

The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around.

We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at [/articles](https://kotonia.ai/articles/).
