Most of us aren’t training frontier models — we’re trying to fit a good one onto the
hardware we actually have. The research that makes that possible (quantization, LoRA/PEFT,
mixture-of-experts, FlashAttention, KV-cache tricks, Mamba/SSMs) is scattered across
hundreds of arXiv papers, and it’s some of the fastest-moving work in ML right now.
So I assembled it into one dataset: fineset-io/efficient-llm-papers I find it useful as a “what’s the current state of the art for making this cheaper”
reference — and as a clean corpus if you’re fine-tuning a model to reason about
efficiency techniques.
Happy to take suggestions on gaps or answer questions about how the pipeline works.