Comparisons — AI & ML approaches side by side | Rudrite Research

Rudrite Research published a comprehensive comparison of AI and ML approaches, covering 14 side-by-side analyses of techniques such as Transformers vs Mamba, FlashAttention vs PagedAttention, and PPO vs DPO vs GRPO. The comparisons detail each method's mechanics, performance metrics, and optimal use cases, serving as a practical guide for practitioners.

Comparisons AI & ML approaches side by side — what each does, the real numbers, and when to use which. Transformers vs Mamba /compare/transformers-vs-mamba — All-pairs attention versus a selective state-space recurrence — quadratic recall against linear-time throughput. FlashAttention vs PagedAttention /compare/flashattention-vs-pagedattention — Two attention optimizations that solve different problems — and are used together, not instead of each other. Dense vs Mixture-of-Experts /compare/dense-vs-mixture-of-experts — Activate every parameter for every token, or route each token to a few of many experts. ReAct vs Toolformer vs ToolRL /compare/react-vs-toolformer-vs-toolrl — Three eras of teaching a model to use a tool — prompt the loop, filter the data on its own loss, or reward the policy. PPO vs DPO vs GRPO /compare/ppo-vs-dpo-vs-grpo — Three ways to turn preferences into a better policy — a full RL loop, a single classification loss, or group-relative RL without a critic. MHA vs GQA vs MLA /compare/mha-vs-gqa-vs-mla — Three points on the attention-memory curve — how much of the KV cache you keep decides how long a context you can afford to serve. GAN vs VAE vs Diffusion /compare/gan-vs-vae-vs-diffusion — Three ways to learn a distribution and sample from it — an adversarial game, a probabilistic autoencoder, and an iterative denoiser. FlashAttention vs FlashAttention-3 /compare/flashattention-vs-flashattention-3 — The same exact-attention algorithm, rebuilt for a new generation of GPU — IO-aware tiling, then Hopper-era asynchrony and FP8. Speculative Decoding vs Medusa vs EAGLE /compare/speculative-decoding-vs-medusa-vs-eagle — Three ways to draft tokens for a target model to verify in parallel — a separate draft model, self-drafting heads, or feature-level autoregression. Scaling Laws vs Chinchilla /compare/scaling-laws-vs-chinchilla — Two readings of the same power laws — one prescribed bigger models, one showed compute-optimal training needs far more data per parameter. BERT vs GPT vs T5 /compare/bert-vs-gpt-vs-t5 — Three ways to pretrain the same transformer — read both directions, predict the next token, or cast every task as text-to-text. AWQ vs GPTQ vs BitNet /compare/awq-vs-gptq-vs-bitnet — Three ways to shrink an LLM — scale the salient weights, compensate the rounding with second-order math, or train ternary so the matmul becomes addition. S4 vs Mamba vs RWKV /compare/s4-vs-mamba-vs-rwkv — The post-Transformer sequence lineage — a structured state space, a selective one, and a linear-attention RNN, all chasing linear cost without losing quality. CoT vs Self-Consistency vs Tree-of-Thoughts /compare/cot-vs-self-consistency-vs-tot — One chain, many chains, or a searched tree of chains — three rungs of a reasoning ladder, none of which touch the weights. DDPM vs Flow Matching vs Consistency Models /compare/ddpm-vs-flow-matching-vs-consistency — One family, three answers to the same question — how should a model walk from noise to data?