BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090
BeeLlama v0.2.0 demonstrates that speculative decoding can achieve a 4.4x to 4.93x throughput multiplier on a single RTX 3090, running 27B and 31B parameter models at 37-36 tokens per second baseline versus 164-178 token…