Local AI: 775 tok/s, DiffusionGemma (BF16) on Nvidia RTX 6000 Pro

A developer achieved 775 tokens per second running the full BF16 DiffusionGemma model on an Nvidia RTX 6000 Pro using a Red Hat fork of vLLM, demonstrating extremely fast local AI inference at short contexts. However, performance degrades sharply with longer inputs, with a time-to-first-token of 22 seconds at 100,000 tokens. The result highlights the trade-off between speed and context length for locally deployed large language models.

Inanely Fast Local AI: 775 token per second 🤯 I was able to run the new DiffusionGemma full BF16 model by @googlegemma on vLLM fork by Red Hat on Nvidia RTX 6000 Pro. It's blazing fast at short contexts, but gets slow very quickly. At 100k, TTFT is 22s ■ Leave a comment setup and command to run the model.