# Show HN: Landmark AI and ML research explained, redrawn, animated

> Source: <https://research.rudrite.com/>
> Published: 2026-06-14 17:29:26+00:00

# Rudrite Research — the frontier, made legible

Interactive, animated, visual explainers of landmark AI & ML papers — the systems and ideas behind the models you use, redrawn and made legible. Free and open.

[Browse all 100 explainers](/library) · [Guided reading tracks](/tracks)

[Attention Is All You Need](/attention)[FlashAttention](/flash-attention)[PagedAttention (vLLM)](/paged-attention)[Megatron-LM](/megatron-lm)[DeepSeek-R1](/deepseek-r1)[GPT-3: Language Models are Few-Shot Learners](/gpt-3)[ZeRO: Zero Redundancy Optimizer](/zero)[Mixtral of Experts](/mixtral)[Training Compute-Optimal Large Language Models](/chinchilla)[Mamba: Linear-Time Sequence Modeling with Selective State Spaces](/mamba)[BERT: Pre-training of Deep Bidirectional Transformers](/bert)[DeepSeek-V3](/deepseek-v3)[Qwen3](/qwen3)[OLMo 2](/olmo-2)[MiniMax-01](/minimax-01)[Gemma 4](/gemma-4)[Scaling Laws for Neural Language Models](/scaling-laws)[Adam: A Method for Stochastic Optimization](/adam)[Deep Residual Learning for Image Recognition](/resnet)[Denoising Diffusion Probabilistic Models](/ddpm)[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](/switch-transformers)[LoRA: Low-Rank Adaptation of Large Language Models](/lora)[GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](/gpipe)[GSPMD: General and Scalable Parallelization for ML Computation Graphs](/gspmd)[Pathways: Asynchronous Distributed Dataflow for ML](/pathways)[Ring Attention with Blockwise Transformers for Near-Infinite Context](/ring-attention)[Efficiently Scaling Transformer Inference](/scaling-inference)[Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](/mooncake)[Fast Inference from Transformers via Speculative Decoding](/speculative-decoding)[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](/chain-of-thought)[Training language models to follow instructions with human feedback](/instructgpt)[Direct Preference Optimization: Your Language Model is Secretly a Reward Model](/dpo)[DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](/deepseekmath)[Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](/test-time-compute)[Constitutional AI: Harmlessness from AI Feedback](/constitutional-ai)[DAPO: An Open-Source LLM Reinforcement Learning System at Scale](/dapo)[Tree of Thoughts: Deliberate Problem Solving with Large Language Models](/tree-of-thoughts)[ReAct: Synergizing Reasoning and Acting in Language Models](/react)[FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision](/flash-attention-3)[Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](/mamba-2)[DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](/deepseek-v2)[EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](/eagle)[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](/awq)[RoFormer: Enhanced Transformer with Rotary Position Embedding](/rope)[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](/vision-transformer)[Learning Transferable Visual Models From Natural Language Supervision](/clip)[High-Resolution Image Synthesis with Latent Diffusion Models](/latent-diffusion)[Scalable Diffusion Models with Transformers](/dit)[Robust Speech Recognition via Large-Scale Weak Supervision](/whisper)[Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention](/native-sparse-attention)[Group Sequence Policy Optimization](/gspo)[DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving](/distserve)[CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion](/cacheblend)[GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](/gshard)[GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints](/gqa)[YaRN: Efficient Context Window Extension of Large Language Models](/yarn)[Efficient Streaming Language Models with Attention Sinks](/streaming-llm)[Generative Adversarial Networks](/gan)[Segment Anything](/segment-anything)[Visual Instruction Tuning](/llava)[s1: Simple test-time scaling](/s1)[Tülu 3: Pushing Frontiers in Open Language Model Post-Training](/tulu-3)[Let's Verify Step by Step](/lets-verify)[Self-Consistency Improves Chain of Thought Reasoning in Language Models](/self-consistency)[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](/rag)[SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](/swe-bench)[The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](/bitnet)[KAN: Kolmogorov–Arnold Networks](/kan)[Differential Transformer](/differential-transformer)[Mixture-of-Depths: Dynamically allocating compute in transformer-based language models](/mixture-of-depths)[RWKV: Reinventing RNNs for the Transformer Era](/rwkv)[Titans: Learning to Memorize at Test Time](/titans)[Byte Latent Transformer: Patches Scale Better Than Tokens](/byte-latent-transformer)[The Llama 3 Herd of Models](/llama-3)[Mistral 7B](/mistral-7b)[Phi-4 Technical Report](/phi-4)[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](/flash-attention-2)[Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](/medusa)[Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](/stable-diffusion-3)[Flow Matching for Generative Modeling](/flow-matching)[Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty](/rlcr)[Rewarding Doubt: Calibrated Confidence Expression of LLMs](/rewarding-doubt)[Why Language Models Hallucinate](/why-llms-hallucinate)[τ-bench: Tool-Agent-User Interaction in Real-World Domains](/tau-bench)[ToolRL: Reward is All Tool Learning Needs](/toolrl)[Group-in-Group Policy Optimization for LLM Agent Training](/gigpo)[MiniMax-M1: Scaling Test-Time Compute with Lightning Attention](/cispo)[ProRL: Prolonged RL Expands Reasoning Boundaries](/prorl)[The Entropy Mechanism of RL for Reasoning Language Models](/entropy-mechanism)[Spurious Rewards: Rethinking Training Signals in RLVR](/spurious-rewards)[GenPRM: Generative Process Reward Models](/genprm)[From Hard Refusals to Safe-Completions](/safe-completions)[Proximal Policy Optimization Algorithms](/ppo)[Efficiently Modeling Long Sequences with Structured State Spaces](/s4)[Auto-Encoding Variational Bayes](/vae)[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](/t5)[Toolformer: Language Models Can Teach Themselves to Use Tools](/toolformer)[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](/gptq)[Muon is Scalable for LLM Training](/muon)[Consistency Models](/consistency-models)