Interactive, animated, visual explainers of landmark AI & ML papers — the systems and ideas behind the models you use, redrawn and made legible. Free and open.
Browse all 100 explainers · Guided reading tracks Attention Is All You NeedFlashAttentionPagedAttention (vLLM)Megatron-LMDeepSeek-R1GPT-3: Language Models are Few-Shot LearnersZeRO: Zero Redundancy OptimizerMixtral of ExpertsTraining Compute-Optimal Large Language ModelsMamba: Linear-Time Sequence Modeling with Selective State SpacesBERT: Pre-training of Deep Bidirectional TransformersDeepSeek-V3Qwen3OLMo 2MiniMax-01Gemma 4Scaling Laws for Neural Language ModelsAdam: A Method for Stochastic OptimizationDeep Residual Learning for Image RecognitionDenoising Diffusion Probabilistic ModelsSwitch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient SparsityLoRA: Low-Rank Adaptation of Large Language ModelsGPipe: Efficient Training of Giant Neural Networks using Pipeline ParallelismGSPMD: General and Scalable Parallelization for ML Computation GraphsPathways: Asynchronous Distributed Dataflow for MLRing Attention with Blockwise Transformers for Near-Infinite ContextEfficiently Scaling Transformer InferenceMooncake: A KVCache-centric Disaggregated Architecture for LLM ServingFast Inference from Transformers via Speculative DecodingChain-of-Thought Prompting Elicits Reasoning in Large Language ModelsTraining language models to follow instructions with human feedbackDirect Preference Optimization: Your Language Model is Secretly a Reward ModelDeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsScaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model ParametersConstitutional AI: Harmlessness from AI FeedbackDAPO: An Open-Source LLM Reinforcement Learning System at ScaleTree of Thoughts: Deliberate Problem Solving with Large Language ModelsReAct: Synergizing Reasoning and Acting in Language ModelsFlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precisionTransformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space DualityDeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language ModelEAGLE: Speculative Sampling Requires Rethinking Feature UncertaintyAWQ: Activation-aware Weight Quantization for LLM Compression and AccelerationRoFormer: Enhanced Transformer with Rotary Position EmbeddingAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleLearning Transferable Visual Models From Natural Language SupervisionHigh-Resolution Image Synthesis with Latent Diffusion ModelsScalable Diffusion Models with TransformersRobust Speech Recognition via Large-Scale Weak SupervisionNative Sparse Attention: Hardware-Aligned and Natively Trainable Sparse AttentionGroup Sequence Policy OptimizationDistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model ServingCacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge FusionGShard: Scaling Giant Models with Conditional Computation and Automatic ShardingGQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsYaRN: Efficient Context Window Extension of Large Language ModelsEfficient Streaming Language Models with Attention SinksGenerative Adversarial NetworksSegment AnythingVisual Instruction Tunings1: Simple test-time scalingTülu 3: Pushing Frontiers in Open Language Model Post-TrainingLet's Verify Step by StepSelf-Consistency Improves Chain of Thought Reasoning in Language ModelsRetrieval-Augmented Generation for Knowledge-Intensive NLP TasksSWE-bench: Can Language Models Resolve Real-World GitHub Issues?The Era of 1-bit LLMs: All Large Language Models are in 1.58 BitsKAN: Kolmogorov–Arnold NetworksDifferential TransformerMixture-of-Depths: Dynamically allocating compute in transformer-based language modelsRWKV: Reinventing RNNs for the Transformer EraTitans: Learning to Memorize at Test TimeByte Latent Transformer: Patches Scale Better Than TokensThe Llama 3 Herd of ModelsMistral 7BPhi-4 Technical ReportFlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningMedusa: Simple LLM Inference Acceleration Framework with Multiple Decoding HeadsScaling Rectified Flow Transformers for High-Resolution Image SynthesisFlow Matching for Generative ModelingBeyond Binary Rewards: Training LMs to Reason About Their UncertaintyRewarding Doubt: Calibrated Confidence Expression of LLMsWhy Language Models Hallucinateτ-bench: Tool-Agent-User Interaction in Real-World DomainsToolRL: Reward is All Tool Learning NeedsGroup-in-Group Policy Optimization for LLM Agent TrainingMiniMax-M1: Scaling Test-Time Compute with Lightning AttentionProRL: Prolonged RL Expands Reasoning BoundariesThe Entropy Mechanism of RL for Reasoning Language ModelsSpurious Rewards: Rethinking Training Signals in RLVRGenPRM: Generative Process Reward ModelsFrom Hard Refusals to Safe-CompletionsProximal Policy Optimization AlgorithmsEfficiently Modeling Long Sequences with Structured State SpacesAuto-Encoding Variational BayesExploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerToolformer: Language Models Can Teach Themselves to Use ToolsGPTQ: Accurate Post-Training Quantization for Generative Pre-trained TransformersMuon is Scalable for LLM TrainingConsistency Models