AI & ML approaches side by side — what each does, the real numbers, and when to use which.
Transformers vs Mamba— All-pairs attention versus a selective state-space recurrence — quadratic recall against linear-time throughput.FlashAttention vs PagedAttention— Two attention optimizations that solve different problems — and are used together, not instead of each other.Dense vs Mixture-of-Experts— Activate every parameter for every token, or route each token to a few of many experts.ReAct vs Toolformer vs ToolRL— Three eras of teaching a model to use a tool — prompt the loop, filter the data on its own loss, or reward the policy.PPO vs DPO vs GRPO— Three ways to turn preferences into a better policy — a full RL loop, a single classification loss, or group-relative RL without a critic.MHA vs GQA vs MLA— Three points on the attention-memory curve — how much of the KV cache you keep decides how long a context you can afford to serve.GAN vs VAE vs Diffusion— Three ways to learn a distribution and sample from it — an adversarial game, a probabilistic autoencoder, and an iterative denoiser.FlashAttention vs FlashAttention-3— The same exact-attention algorithm, rebuilt for a new generation of GPU — IO-aware tiling, then Hopper-era asynchrony and FP8.Speculative Decoding vs Medusa vs EAGLE— Three ways to draft tokens for a target model to verify in parallel — a separate draft model, self-drafting heads, or feature-level autoregression.Scaling Laws vs Chinchilla— Two readings of the same power laws — one prescribed bigger models, one showed compute-optimal training needs far more data per parameter.BERT vs GPT vs T5— Three ways to pretrain the same transformer — read both directions, predict the next token, or cast every task as text-to-text.AWQ vs GPTQ vs BitNet— Three ways to shrink an LLM — scale the salient weights, compensate the rounding with second-order math, or train ternary so the matmul becomes addition.S4 vs Mamba vs RWKV— The post-Transformer sequence lineage — a structured state space, a selective one, and a linear-attention RNN, all chasing linear cost without losing quality.CoT vs Self-Consistency vs Tree-of-Thoughts— One chain, many chains, or a searched tree of chains — three rungs of a reasoning ladder, none of which touch the weights.DDPM vs Flow Matching vs Consistency Models— One family, three answers to the same question — how should a model walk from noise to data?