Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap The author successfully reduced peak VRAM usage of the LTX-2 22B video generation model from 40 GiB to 24 GiB using the model's native `fp8_cast` quantization method. In contrast, the author found that `optimum-quanto` quantization (int8/fp8) was incompatible with the LTX-2 transformer, causing the model to crash during inference. The post documents the debugging process and explains why the native `fp8_cast` approach was chosen over the `optimum-quanto` alternative. Introduction LTX-2.3 https://github.com/Lightricks/LTX-Video is a video generation model from Lightricks that includes audio support. In A2V Audio-to-Video mode, it takes a single image + audio + prompt and generates lip sync, facial expressions, and head/hair motion all at once. Unlike lip-sync-only models like MuseTalk, it can animate an entire scene, which makes it a powerful tool for directing. The catch: the base checkpoint is 22B parameters / 43 GB, and keeping it resident in bf16 with transformer × 2 stage burns ~86 GiB at idle . On an RTX PRO 6000 Blackwell with 96 GiB, that leaves almost nothing for the TTS / Ditto-TalkingHead / Qwen3-TTS-vLLM services running alongside it. After testing quantization approaches, I got LTX-2's native fp8 cast to compress peak VRAM from 40 GiB → 24 GiB A2V cold-start, 768×512 / 97f . Meanwhile, and simply doesn't work. This post documents the debugging and the decisions made along the way. optimum-quanto int8/fp8 has a compatibility issue with the LTX-2 transformer Environment - GPU : NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition 96 GiB - PyTorch : 2.9.1 + CUDA 12.8 - Models : LTX-2.3 22B-dev base + 22B-distilled-lora-384 stage 2 + Gemma-3-12B text encoder bnb 4bit - Deployment : A2V served via scripts/persistent a2v server.py --cold-start . Each request does build → run → free ; idle is 0 GiB. I use cold-start because A2V is called occasionally while conversation is the main workload, and it must coexist with TTS and Ditto. Details in a separate post. Four Candidates Looking at the LTX-2 codebase, there are actually two quantization paths: 1. LTX-2 Native: QuantizationPolicy packages/ltx-core/src/ltx core/quantization/policy.py : @dataclass frozen=True class QuantizationPolicy: sd ops: SDOps | None = None weight transform at state dict load module ops: tuple ModuleOps, ... = module rewrite after load @classmethod def fp8 cast cls - "QuantizationPolicy": """Load weights as float8 e4m3fn, upcast to bf16 during forward""" return cls sd ops=TRANSFORMER LINEAR DOWNCAST MAP, module ops= UPCAST DURING INFERENCE, , @classmethod def fp8 scaled mm cls - "QuantizationPolicy": """FP8 scaled MM requires tensorrt llm """ The implementation behind fp8 cast is Fp8CastLinear : python class Fp8CastLinear torch.nn.Linear : def forward self, input : w up = upcast and round self.weight, input.dtype, ... b up = upcast and round self.bias, input.dtype, ... if self.bias is not None else None return torch.nn.functional.linear input, w up, b up It uses the class reassignment pattern to swap out instances. Weights are stored in fp8 and upcast to bf16 on every forward pass. The fp8 → bf16 cast cost is essentially noise on Blackwell. 2. optimum-quanto The LTX-2 trainer package packages/ltx-trainer has a general-purpose quantization path using optimum-quanto, supporting int8-quanto / int4-quanto / fp8-quanto : python def quantize model model, precision, ... : if hasattr model, "transformer blocks" : quantize blockwise model, ... move one block at a time to GPU, quantize → freeze → CPU else: quantize model, weights=..., exclude=EXCLUDE PATTERNS freeze model return model This looks like it could slot right in after build transformer . Candidate Matrix | Mode | Path | Expected | |---|---|---| fp8-cast | LTX-2 native, sd ops loads as float8 e4m3fn | ~50% memory reduction, near-identical speed | fp8-scaled-mm | LTX-2 native, requires tensorrt llm | Faster throughput | int8-quanto | optimum-quanto, post-build | ~50% memory reduction, speed ± | fp8-quanto | Same, fp8 variant | Potential to hit native FP8 on Blackwell | fp8-scaled-mm is out — no tensorrt llm in this environment. I implemented the remaining three. Stepping on a Mine with int8-quanto The implementation is straightforward: python from ltx trainer.quantization import quantize model transformer 1 = self.pipeline.stage 1. build transformer transformer 1 = quantize model transformer 1, "int8-quanto", device=self.device self.transformer stage 1 = freeze transformer 1 The server starts fine. Idle VRAM looks promising: php load stage 1 transformer no distilled LoRA quantize stage 1 - int8-quanto quantize stage 1 done in 0.71s cuda after stage 1 transformer: allocated=31.28GiB ... load stage 2 transformer with distilled LoRA quantize stage 2 - int8-quanto quantize stage 2 done in 0.52s cuda after stage 2 transformer: allocated=49.40GiB ... server A2V listening on http://127.0.0.1:8892 Resident memory: 51.7 GiB estimated 40% reduction from bf16's 86 GiB . Looks good. Then the first /generate request: timing prompt encode=0.75s timing audio encode=0.39s 0%| | 0/30 00:00