Sharded Inference of a 229B-Parameter Moe over the Internet at Interactive Speed

A new technical report details the sharded inference of a 229-billion-parameter mixture-of-experts model across five consumer GPUs in five countries over the public internet, achieving 12.6 tokens per second interactively and 194 tokens per second in batch mode, with cryptographic receipts on every request.

1/ We published our first technical report today. We ran a 229B model split across five consumer GPUs in five countries over the public internet and measured 12.6 tok/s interactive, 194 tok/s batched. With cryptographic receipts on every request. doi.org/10.5281/zenodo…