# Comparisons — AI & ML approaches side by side | Rudrite Research

> Source: <https://research.rudrite.com/compare>
> Published: 2026-06-13 00:00:00+00:00

# Comparisons

AI & ML approaches side by side — what each does, the real numbers, and when to use which.

[Transformers vs Mamba](/compare/transformers-vs-mamba)— All-pairs attention versus a selective state-space recurrence — quadratic recall against linear-time throughput.[FlashAttention vs PagedAttention](/compare/flashattention-vs-pagedattention)— Two attention optimizations that solve different problems — and are used together, not instead of each other.[Dense vs Mixture-of-Experts](/compare/dense-vs-mixture-of-experts)— Activate every parameter for every token, or route each token to a few of many experts.[ReAct vs Toolformer vs ToolRL](/compare/react-vs-toolformer-vs-toolrl)— Three eras of teaching a model to use a tool — prompt the loop, filter the data on its own loss, or reward the policy.[PPO vs DPO vs GRPO](/compare/ppo-vs-dpo-vs-grpo)— Three ways to turn preferences into a better policy — a full RL loop, a single classification loss, or group-relative RL without a critic.[MHA vs GQA vs MLA](/compare/mha-vs-gqa-vs-mla)— Three points on the attention-memory curve — how much of the KV cache you keep decides how long a context you can afford to serve.[GAN vs VAE vs Diffusion](/compare/gan-vs-vae-vs-diffusion)— Three ways to learn a distribution and sample from it — an adversarial game, a probabilistic autoencoder, and an iterative denoiser.[FlashAttention vs FlashAttention-3](/compare/flashattention-vs-flashattention-3)— The same exact-attention algorithm, rebuilt for a new generation of GPU — IO-aware tiling, then Hopper-era asynchrony and FP8.[Speculative Decoding vs Medusa vs EAGLE](/compare/speculative-decoding-vs-medusa-vs-eagle)— Three ways to draft tokens for a target model to verify in parallel — a separate draft model, self-drafting heads, or feature-level autoregression.[Scaling Laws vs Chinchilla](/compare/scaling-laws-vs-chinchilla)— Two readings of the same power laws — one prescribed bigger models, one showed compute-optimal training needs far more data per parameter.[BERT vs GPT vs T5](/compare/bert-vs-gpt-vs-t5)— Three ways to pretrain the same transformer — read both directions, predict the next token, or cast every task as text-to-text.[AWQ vs GPTQ vs BitNet](/compare/awq-vs-gptq-vs-bitnet)— Three ways to shrink an LLM — scale the salient weights, compensate the rounding with second-order math, or train ternary so the matmul becomes addition.[S4 vs Mamba vs RWKV](/compare/s4-vs-mamba-vs-rwkv)— The post-Transformer sequence lineage — a structured state space, a selective one, and a linear-attention RNN, all chasing linear cost without losing quality.[CoT vs Self-Consistency vs Tree-of-Thoughts](/compare/cot-vs-self-consistency-vs-tot)— One chain, many chains, or a searched tree of chains — three rungs of a reasoning ladder, none of which touch the weights.[DDPM vs Flow Matching vs Consistency Models](/compare/ddpm-vs-flow-matching-vs-consistency)— One family, three answers to the same question — how should a model walk from noise to data?