{"slug": "cutting-ltx-2-22b-peak-vram-by-40-with-fp8-cast-and-why-optimum-quanto-was-a", "title": "Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap", "summary": "The author successfully reduced peak VRAM usage of the LTX-2 22B video generation model from 40 GiB to 24 GiB using the model's native `fp8_cast` quantization method. In contrast, the author found that `optimum-quanto` quantization (int8/fp8) was incompatible with the LTX-2 transformer, causing the model to crash during inference. The post documents the debugging process and explains why the native `fp8_cast` approach was chosen over the `optimum-quanto` alternative.", "body_md": "## Introduction\n\n[LTX-2.3](https://github.com/Lightricks/LTX-Video) is a video generation model from Lightricks that includes audio support. In A2V (Audio-to-Video) mode, it takes **a single image + audio + prompt** and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing.\n\nThe catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with `transformer × 2 stage`\n\nburns**~86 GiB at idle**. On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it.\n\nAfter testing quantization approaches, I got**LTX-2's native fp8_cast to compress peak VRAM from 40 GiB → 24 GiB**(A2V cold-start, 768×512 / 97f). Meanwhile,** and simply doesn't work. This post documents the debugging and the decisions made along the way.**`optimum-quanto`\n\nint8/fp8 has a compatibility issue with the LTX-2 transformer##\n\nEnvironment\n\n-**GPU**: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96 GiB) -** PyTorch**: 2.9.1 + CUDA 12.8 -** Models**: LTX-2.3 22B-dev (base) + 22B-distilled-lora-384 (stage_2) + Gemma-3-12B text encoder (bnb 4bit) -** Deployment**: A2V served via`scripts/persistent_a2v_server.py --cold-start`\n\n. Each request does`build → run → free`\n\n; idle is 0 GiB.\n\nI use cold-start because A2V is called occasionally while conversation is the main workload, and it must coexist with TTS and Ditto. Details in a separate post.\n\n## Four Candidates\n\nLooking at the LTX-2 codebase, there are actually two quantization paths:\n\n### 1. LTX-2 Native: `QuantizationPolicy`\n\n`packages/ltx-core/src/ltx_core/quantization/policy.py`\n\n:\n\n```\n@dataclass(frozen=True)\nclass QuantizationPolicy:\n    sd_ops: SDOps | None = None              # weight transform at state dict load\n    module_ops: tuple[ModuleOps, ...] = ()   # module rewrite after load\n\n    @classmethod\n    def fp8_cast(cls) -> \"QuantizationPolicy\":\n        \"\"\"Load weights as float8_e4m3fn, upcast to bf16 during forward\"\"\"\n        return cls(\n            sd_ops=TRANSFORMER_LINEAR_DOWNCAST_MAP,\n            module_ops=(UPCAST_DURING_INFERENCE,),\n        )\n\n    @classmethod\n    def fp8_scaled_mm(cls) -> \"QuantizationPolicy\":\n        \"\"\"FP8 scaled MM (requires tensorrt_llm)\"\"\"\n```\n\nThe implementation behind `fp8_cast`\n\nis `Fp8CastLinear`\n\n:\n\n``` python\nclass Fp8CastLinear(torch.nn.Linear):\n    def forward(self, input):\n        w_up = _upcast_and_round(self.weight, input.dtype, ...)\n        b_up = _upcast_and_round(self.bias, input.dtype, ...) if self.bias is not None else None\n        return torch.nn.functional.linear(input, w_up, b_up)\n```\n\nIt uses the `__class__`\n\nreassignment pattern to swap out instances. Weights are stored in fp8 and upcast to bf16 on every forward pass. The fp8 → bf16 cast cost is essentially noise on Blackwell.\n\n### 2. optimum-quanto\n\nThe LTX-2 trainer package (`packages/ltx-trainer`\n\n) has a general-purpose quantization path using optimum-quanto, supporting `int8-quanto`\n\n/ `int4-quanto`\n\n/ `fp8-quanto`\n\n:\n\n``` python\ndef quantize_model(model, precision, ...):\n    if hasattr(model, \"transformer_blocks\"):\n        _quantize_blockwise(model, ...)   # move one block at a time to GPU, quantize → freeze → CPU\n    else:\n        quantize(model, weights=..., exclude=EXCLUDE_PATTERNS)\n        freeze(model)\n    return model\n```\n\nThis looks like it could slot right in after `_build_transformer()`\n\n.\n\n### Candidate Matrix\n\n| Mode | Path | Expected |\n|---|---|---|\n`fp8-cast` |\nLTX-2 native, sd_ops loads as float8_e4m3fn | ~50% memory reduction, near-identical speed |\n`fp8-scaled-mm` |\nLTX-2 native, requires tensorrt_llm | Faster throughput |\n`int8-quanto` |\noptimum-quanto, post-build | ~50% memory reduction, speed ± |\n`fp8-quanto` |\nSame, fp8 variant | Potential to hit native FP8 on Blackwell |\n\n`fp8-scaled-mm`\n\nis out — no tensorrt_llm in this environment. I implemented the remaining three.\n\n## Stepping on a Mine with `int8-quanto`\n\nThe implementation is straightforward:\n\n``` python\nfrom ltx_trainer.quantization import quantize_model\n\ntransformer_1 = self.pipeline.stage_1._build_transformer()\ntransformer_1 = quantize_model(transformer_1, \"int8-quanto\", device=self.device)\nself.transformer_stage_1 = _freeze(transformer_1)\n```\n\nThe server starts fine. Idle VRAM looks promising:\n\n``` php\n[load] stage_1 transformer (no distilled LoRA)\n[quantize] stage_1 -> int8-quanto\n[quantize] stage_1 done in 0.71s\n[cuda] after stage_1 transformer: allocated=31.28GiB ...\n[load] stage_2 transformer (with distilled LoRA)\n[quantize] stage_2 -> int8-quanto\n[quantize] stage_2 done in 0.52s\n[cuda] after stage_2 transformer: allocated=49.40GiB ...\n[server] A2V listening on http://127.0.0.1:8892\n```\n\nResident memory:**51.7 GiB**(estimated 40% reduction from bf16's 86 GiB). Looks good.\n\nThen the first `/generate`\n\nrequest:\n\n```\n[timing] prompt_encode=0.75s\n[timing] audio_encode=0.39s\n  0%|          | 0/30 [00:00<?, ?it/s]\n[http] POST /generate 400\n```\n\nCrashes at step 0/30. The error:\n\n```\n{\"error\": \"linear(): argument 'weight' (position 2) must be Tensor, not NoneType\"}\n```\n\nSomething is calling `torch.nn.functional.linear(input, weight=None, bias=None)`\n\n. After quanto's `freeze()`\n\n,** self.weight is being referenced as None somewhere in a Linear layer**.\n\n### Why Does `weight`\n\nBecome None?\n\nTwo rough hypotheses:**LTX-2's Linear layers assume** Just like`__class__`\n\nreassignment.`Fp8CastLinear`\n\n, the pattern relies on keeping instance state intact while swapping the class-level`forward`\n\n. quanto's`quantize()`\n\n→`freeze()`**replaces**` nn.Linear`\n\nwith its own`QLinear`\n\nwrapper, and that replacement likely breaks the`weight`\n\nattribute reference somewhere in the process.LTX-trainer's`EXCLUDE_PATTERNS`\n\ndoesn't work in the blockwise path.`_quantize_blockwise`\n\npulls out one`transformer_block`\n\nat a time and calls`quantize(block, exclude=EXCLUDE_PATTERNS)`\n\n. But`EXCLUDE_PATTERNS`\n\nuses glob patterns like`patchify_proj`\n\n,`*adaln*`\n\n,`time_proj`\n\n— these are relative to the whole model, not to a single block.**They won't match relative paths inside a block**, so layers that should be excluded end up getting quantized.\n\nEither way, fixing this properly means reading through quanto's wrapper implementation plus all the forward paths in the LTX-2 transformer. The cost isn't worth it.**I decided to cut my losses and switch to LTX-2 native fp8_cast.**## Switching to `fp8_cast`\n\nThree lines of code:\n\n```\n# Just pass the quantization policy when building the pipeline\npipeline_quantization = None\nif transformer_quantization == \"fp8-cast\":\n    from ltx_core.quantization import QuantizationPolicy\n    pipeline_quantization = QuantizationPolicy.fp8_cast()\n\nself.pipeline = A2VidPipelineTwoStage(\n    ...,\n    quantization=pipeline_quantization,\n    ...\n)\n```\n\n`fp8_cast`**downcasts weights to fp8 during the load phase**. Since `sd_ops`\n\nhooks into state_dict loading, the 43 GB safetensors file gets fp8-converted during streaming load. Unlike quanto, which fully expands bf16 in memory before quantizing,**peak VRAM never spikes**— a nice property.\n\nOn startup:\n\n```\n[load] A2VidPipelineTwoStage builders (pipeline_quantization=QuantizationPolicy(sd_ops=...fp8_cast...))\n...\n[cuda] after stage_1 transformer: allocated=31.30GiB reserved=35.18GiB\n[cuda] after stage_2 transformer: allocated=49.43GiB reserved=53.64GiB\n[server] A2V listening on http://127.0.0.1:8892\n```\n\nResident allocated (51.7 GiB) is on par with int8-quanto, but**reserved is only 53.6 GiB — dramatically lower**(int8-quanto was 70.9 GiB). Lower reserved means more headroom for activations.\n\nAnd the first `/generate`\n\n:\n\n```\n{\n  \"elapsed_seconds\": 39.367,\n  \"peak_vram_gib\": 57.918,\n  \"width\": 768, \"height\": 512, \"num_frames\": 97\n}\n```**It works.** Back on track.\n\n## Benchmarks\n\nFixed conditions, persistent + fp8-cast, 3 resolutions × 3 runs each:\n\n- Image: 1024×512 portrait\n- Audio: 9.08-second Japanese sample generated with Irodori-TTS\n- Prompt: \"A young woman speaks calmly to the camera in a softly lit room.\"\n- num_frames: 97 (= 4.04s @ 24fps)\n- seed: 42 fixed\n\n| Resolution | Avg elapsed (s) | Peak VRAM (GiB) |\n|---|---|---|\n| 768×512 / 97f | 39.84 |\n57.92 |\n| 1024×768 / 97f | 66.71 |\n59.06 |\n| 1280×768 / 97f | 84.02 |\n58.30 |\n\nKey observations:\n\n-**Near-zero variance across 3 runs**(fixed seed → byte-identical output mp4) -** Peak VRAM is almost independent of resolution**(57.9–59.1 GiB). Resident weights dominate; activation memory is only ~7 GiB -** 1280×768 now works stably in persistent mode.**This resolution was effectively impossible with bf16 persistent (~91 GiB peak)\n\n## Cold-Start Also Wins\n\nProduction runs in cold-start mode (A2V fires once or twice every few minutes, must coexist with TTS). Since `fp8_cast`\n\npolicy is applied via `sd_ops`\n\nat pipeline construction time, it carries over naturally to per-request cold-start builds.\n\nCold-start + fp8-cast, single run (768×512 / 97f):\n\n```\n{\n  \"elapsed_seconds\": 88.775,\n  \"peak_vram_gib\": 23.901\n}\n```\n\n| bf16 cold-start | fp8-cast cold-start |\n|\n|---|---|---|\n| Per-request time | ~60–90s |\n88.8s (disk I/O bound, same order) |\n| Peak VRAM | ~40 GiB | 23.9 GiB (~40% reduction) |\n| Idle | 0 GiB | 0 GiB |\n| Coexistence (TTS+Ditto+Qwen3+MuseTalk) | Possible |\nComfortable (~30 GiB peak) |\n\nSpeed is bottlenecked by disk I/O so fp8 doesn't hurt, but**freeing up 16 GiB of peak headroom matters**. Qwen3-TTS-vLLM (7 GiB) and MuseTalk warmup can now run concurrently with A2V generation without OOM.\n\n## Decision Matrix\n\n| Use case | Recommended mode | Rationale |\n|---|---|---|\n| Conversation-first, A2V occasionally | cold-start + fp8-cast |\nIdle 0, peak 24 GiB, comfortable coexistence with TTS/Ditto |\n| Frequent A2V (batch generation, automated direction) | persistent + fp8-cast | Pay the 52 GiB resident cost, get 40s/req |\n| 1024+ resolution, quality focus | persistent + fp8-cast | 1280×768 stable (impossible with bf16 persistent) |\n| Single GPU hosting everything | cold-start + fp8-cast | Persistent eats 52 GiB; depends on budget allocation across services |\n\nProduction decision:**cold-start + fp8-cast for now since conversation is primary. Switch to persistent fp8-cast if paying users drive enough A2V volume to justify the idle cost.**## Summary\n\n- LTX-2 22B at bf16 idle (86 GiB) nearly monopolizes a single GPU. Quantization is close to mandatory.\n-\nIt dies with`optimum-quanto`\n\nis incompatible with the LTX-2 transformer.`F.linear(weight=None)`\n\n. Root cause is likely the`__class__`\n\nreassignment pattern and/or`EXCLUDE_PATTERNS`\n\nnot working correctly in the blockwise path. Not worth digging into. -**LTX-2 native** fp8 at load time, bf16 upcast during forward. Three lines of code to enable.`QuantizationPolicy.fp8_cast()`\n\nis the right answer. - cold-start + fp8-cast: peak 40 → 24 GiB. persistent + fp8-cast: 1280×768 becomes usable.\n- LTX-2 also has\n`fp8_scaled_mm`\n\n(requires tensorrt_llm) — worth trying if you're willing to set up TRT.\n\n## Appendix: Launch Command and Reproduction\n\nProduction cold-start + fp8-cast launch:\n\n```\nPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True nohup uv run python scripts/persistent_a2v_server.py \\\n  --port 8892 \\\n  --checkpoint-path models/LTX-2.3/ltx-2.3-22b-dev.safetensors \\\n  --distilled-lora-path models/loras/ltx-2.3-22b-distilled-lora-384-1.1.safetensors \\\n  --spatial-upsampler-path models/LTX-2.3/ltx-2.3-spatial-upscaler-x2-1.1.safetensors \\\n  --gemma-root models/gemma-3-12b-it-qat-q4_0-unquantized \\\n  --output-dir outputs/a2v_server \\\n  --transformer-quantization fp8-cast \\\n  --cold-start \\\n  > /tmp/ltx_a2v_server.log 2>&1 &\n```\n\n`persistent_a2v_server.py`\n\nis the official LTX-2 repo script extended for A2V. The `--transformer-quantization fp8-cast`\n\nflag was added via a local patch.\n\nImplementation patch (key parts):\n\n```\n# scripts/persistent_a2v_server.py\npipeline_quantization = None\nif transform", "url": "https://wpnews.pro/news/cutting-ltx-2-22b-peak-vram-by-40-with-fp8-cast-and-why-optimum-quanto-was-a", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/cutting-ltx-2-22b-peak-vram-by-40-with-fp8cast-and-why-optimum-quanto-was-a-trap-1o8d", "published_at": "2026-05-22 11:23:06+00:00", "updated_at": "2026-05-22 11:34:52.283429+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "open-source", "research", "developer-tools"], "entities": ["Lightricks", "LTX-2", "MuseTalk", "RTX PRO 6000 Blackwell", "NVIDIA", "Qwen3-TTS-vLLM", "optimum-quanto", "Ditto-TalkingHead"], "alternates": {"html": "https://wpnews.pro/news/cutting-ltx-2-22b-peak-vram-by-40-with-fp8-cast-and-why-optimum-quanto-was-a", "markdown": "https://wpnews.pro/news/cutting-ltx-2-22b-peak-vram-by-40-with-fp8-cast-and-why-optimum-quanto-was-a.md", "text": "https://wpnews.pro/news/cutting-ltx-2-22b-peak-vram-by-40-with-fp8-cast-and-why-optimum-quanto-was-a.txt", "jsonld": "https://wpnews.pro/news/cutting-ltx-2-22b-peak-vram-by-40-with-fp8-cast-and-why-optimum-quanto-was-a.jsonld"}}