Frontier Model Training Methodologies

Seven open-weight frontier models, including Hugging Face's SmolLM3, DeepSeek-R1, and OpenAI's gpt-oss-120b, were analyzed to distill common training methodologies for multi-billion parameter models, with a focus on architecture, stability, data curation, and optimization techniques. The analysis reveals that frontier model training is primarily a systems engineering challenge, where data mixture, architecture choices, and stability measures outweigh algorithmic tweaks, and that most training failures stem from high learning rates, problematic data batches, or load imbalance in mixture-of-experts models. The findings provide a minimal training playbook emphasizing early evaluation locking, baseline architecture selection, and rigorous data pipeline construction to guide future large-scale model development.

frontier model training methodologies How do labs train a frontier, multi-billion parameter model? We look towards seven open-weight frontier models: Hugging Face’s SmolLM3 https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook , Prime Intellect’s Intellect 3 https://arxiv.org/abs/2512.16144 , Nous Research’s Hermes 4 https://arxiv.org/abs/2508.18255 , OpenAI’s gpt-oss-120b https://arxiv.org/pdf/2508.10925 , Moonshot’s Kimi K2 https://arxiv.org/pdf/2507.20534 , DeepSeek’s DeepSeek-R1 https://arxiv.org/pdf/2501.12948 , and Arcee’s Trinity series https://github.com/arcee-ai/trinity-large-tech-report/blob/main/Arcee%20Trinity%20Large.pdf . This blog is an attempt at distilling the techniques, motivations, and considerations used to train their models with an emphasis on training methodology over infrastructure. These notes are largely structured based on Hugging Face’s SmolLM3 report https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook due to its extensiveness, and it is currently supplemented with notes from other reports including Intellect-3, gpt-oss-120b, Hermes 4, DeepSeek, and Kimi. While this blog explores some infrastructure-related ideas like in-flight weight updates and multi-client orchestrators, there are many other ideas mentioned throughout those posts/blogs like expert parallelism and quantization. Hugging Face writes more about gpt-oss-120b’s infrastructure here https://huggingface.co/blog/faster-transformers . table of contents tl;dr tldr architecture and set-up architecture-and-set-up stability stability tokenizer tokenizer optimizers and training hyperparameters optimizers-and-training-hyperparameters data curation and pre-training data-curation-and-pre-training mid-training mid-training post-training post-training behaviors and safety behaviors-and-safety the training marathon the-training-marathon tl;dr - Frontier training is a systems problem: data mixture, architecture, and stability choices dominate most algorithmic tweaks. - Start from a strong baseline and ablate fast and reliably; derisk changes and avoid multi-variable edits. - For long context, document masking + RNoPE/YaRN-style scaling is a robust default; attention variants trade compute for reach. - GQA with small groups 2/4/8 groups typically outperforms MHA and MQA in ablations at similar model scales; MLA cuts KV cache but raises implementation complexity. - MoE is efficient when it is load-balanced; routing, auxiliary or bias balancing, and global stats are non-negotiable. - Tokenizer design should mirror target data; vocab size trades embedding cost against token compression and KV cache. - AdamW is still the default; Muon can help but needs careful infra all-to-all, padding, scaling quirks . - Scaling laws guide, but many frontier models overtrain; inference cost and sparsity tradeoffs often drive final choices. - Data scheduling matters: multi-stage mixtures and late-stage high-quality injection shape final behavior. - Mid-training and post-training SFT + preference/RL/distillation often determine reasoning and tool-use behavior. - Training ops are frequent failure points: dataloader design, throughput, seeds in TP, and checkpointing. - Most training failures stem from common causes: high learning rates, problematic data batches, load imbalance in MoE models, or storage/infrastructure issues see “the usual suspects” section for details . a minimal training playbook - Define the product goal and lock evals early across knowledge, math, code, long-context, and instruction following. - Pick a baseline architecture with known failure modes; default to dense + GQA + RoPE/RNoPE unless MoE is essential. - Choose a tokenizer matched to your target languages and domains; freeze vocab and special tokens early. - Build the data pipeline with deduplication, filtering, and contamination checks; measure data quality explicitly. - Run small ablations for attention, positional encoding, optimizer, and learning rate schedule; change one variable at a time. - Plan a multi-stage data mixture; delay the best data and reasoning-heavy data toward the end. - Add stability guardrails: logit softcapping preferred, per Gemma or z-loss/QK-norm, gradient clipping, precision policy, loss spike alerts. - Validate throughput on long runs and confirm dataloader behavior packing, shuffling, random access . - Run the main training with interval evals and consistent seeds, especially for tensor parallelism. - Mid-train for domain gaps if SFT reveals them; extend context length gradually 4k → 32k → 64k → 128k . - Post-train with SFT, then choose preference/RL/distillation based on verifiable rewards and tool-use goals. - Re-evaluate, run safety checks, and lock a release checkpoint with full logs and configs. general practices - “ Learn to identify what’s worth testing, not just how to run tests. Perfect ablations on irrelevant choices waste as much compute as sloppy ablations on important ones.”- Ablations need to be fast faster iteration $\rightarrow$ more hypotheses tested and reliable need strong discriminative power because otherwise, it may be noise - “The real value of a solid ablation setup goes beyond just building a good model. When things inevitably go wrong during our main training run and they will, no matter how much we prepare , we want to be confident in every decision we made and quickly identify which components weren’t properly tested and could be causing the issues. This preparation saves debugging time and keeps our sanity intact. There’s nothing worse than staring at a mysterious training failure with no idea where the bug could be hiding.” - Ablations need to be Choose an established baseline with good architecture and training setup design . These take years of iteration, and people have discovered common failure modes and instabilities.- There are a plethora of modifiable components attention mechanisms and positional encodings to name a few , but follow the principle of derisking : “never change anything unless you’ve tested that it helps.” - There are a plethora of modifiable components attention mechanisms and positional encodings to name a few , but follow the principle of In evals, look for monotonicity score improvement , low noise e.g. score resistance to random seeds , above-random performance random-level performance for extended time frames isn’t useful , and ranking consistency ranking of approaches should remain stable throughout training .- Prioritize evals Between pre-training and post-training, core evals should be preserved, and their implementation should be finished long before the base model is finished training. Balance exploration and execution. For methods, choose flexibility and stability over peak performance, set a deadline for exploration. architecture and set-up Architecture decisions fundamentally determine a model’s efficiency, capabilities, and training dynamics. Model families like DeepSeek, gpt-oss-120b, Kimi, and SmolLM have vastly different architectures dense vs MoE , attention mechanisms MHA vs MLA vs GQA , position encodings RoPE, partial RoPE, NoPE , among many others. Not all information about the models is publicly available, so some are chosen: | Kimi-K2 | Trinity Large | gpt-oss-120b | OLMo 3 | SmolLM | | |---|---|---|---|---|---| | Parameter Count | 1.06T | 400B | 116.83B | 32B | 3B | | Attention | MLA | GQA 8 groups | GQA 8 groups | GQA ? | GQA 4 groups | | Positional Embedding | RoPE ? + YARN | RoPE + YARN | RoPE + YARN | RoPE + YARN | RNoPE + YARN | | Architecture | MoE | MoE | MoE | dense | dense | | Tokenizer | | o200k harmony https://github.com/openai/tiktoken cl 100k https://github.com/openai/tiktoken When choosing architecture, Hugging Face suggests following a decision tree such that if one of these is true, choose a dense architecture: - memory-constrained since MoEs must have all experts loaded - new to LLM training focus on basics - tighter timeline simpler training with well-documented recipes architecture decision heuristics - If you are memory- or infra-constrained, default to a dense model with GQA and RoPE/RNoPE. - If you need inference efficiency at scale and can manage routing complexity, consider MoE with strong load balancing. - If long context is a core requirement, plan for document masking plus RoPE scaling ABF/YaRN or RNoPE variants. - If you need simpler kernels and faster iteration, avoid novel attention variants unless you can ablate them cleanly. attention Multi-head attention MHA uses separate query, key, and value projections for each attention head, but this creates a large KV-cache that becomes an inference bottleneck and GPU memory hoarder. To address this, researchers developed multi-query attention https://arxiv.org/abs/1911.02150 MQA and grouped query attention https://arxiv.org/abs/2305.13245 GQA . In MQA, KV values are shared across all heads, but this comes at a cost of leaking attention capacity because heads can’t store information specialized to each head’s role. GQA softens this issue by sharing KV values across a small group of heads e.g. 4 . Another alternative is multi-latent attention MLA which stores a compressed latent variable that can be decompressed/projected into KV values at runtime. The latent variable is typically much smaller than the full KV cache often achieving 4-8x compression , and this results in a KV-cache parameter count more comparable to GQA while maintaining performance stronger than MQA. When ablating for variables that change the parameter count such as changing MHA to GQA, they occasionally adjust other hyperparameters to keep model sizes roughly the same , Hugging Face found that GQA with small groups 2/4/8 outperformed MHA in their ablations and that MHA outperformed MQA and GQA with 16 groups . Across benchmarks like HellaSwag, MMLU, and ARC, GQA with 2/4/8 groups performed best in their experiments. gated attention Gated attention applies an elementwise gating mechanism to the scaled dot-product attention output before the output projection. A gate vector $\mathbf{g} t = \sigma \mathbf{W}^G \mathbf{x} t $ is computed from the input, where $\mathbf{x} t$ is the input at position $t$, $\sigma$ is the sigmoid function, and $\mathbf{W}^G$ is a learned gate projection matrix. This gate is split across $h q$ attention heads where $h q$ is the number of query heads , and each head’s attention output is elementwise multiplied by its corresponding gate segment: $\tilde{\mathbf{o}} {t,i} = \mathbf{o}^{\text{sdpa}} {t,i} \odot \mathbf{g} {t,i}$ where $\mathbf{o}^{\text{sdpa}} {t,i}$ represents the scaled dot-product attention output for head $i$ at position $t$, $\odot$ denotes elementwise multiplication, and $\mathbf{g} {t,i}$ is the gate segment for head $i$. The gated outputs are then concatenated and projected through the output matrix $\mathbf{W}^O$ to produce the final output. Gated attention reduces attention sinks tokens receiving disproportionately high attention , reduces large activations that destabilize training, and improves performance on evaluations and long-sequence generalization. Critically, it stabilizes training and reduces loss spikes, making it valuable for large-scale training. document masking When pre-training, a common consideration is fixed sequence lengths since training uses tensors of the form batch, sequence length, hidden , so with regards to batching and distributed training, GPUs are most happy when every example has the same sequence length. But due to variable document length and wanting to avoid padding which wastes compute, packing enables shuffling and concatenating documents within the same sequence to achieve the sequence length. Causal masking means that for unrelated files $A$ and $B$ in the same batch, the tokens in $B$ can attend to the tokens in $A$, which degrades performance. With intra-document masking , the attention mask is modified so tokens can only attend to previous tokens within the same document. Many papers have found benefits relating to long-context extension https://arxiv.org/abs/2407.21783 and some short context benchmarks https://arxiv.org/abs/2410.02660 as well as shortening the average context length https://arxiv.org/abs/2503.15450 . Figure 1 : Comparison of causal masking left and intra-document masking right . Causal masking allows tokens to attend to all preceding tokens regardless of document boundaries, while intra-document masking restricts attention to tokens within the same document. From @PMinervini https://x.com/PMinervini/status/1777596492351422866 . When implementing document masking, Hugging Face saw small improvements on PIQA but otherwise no noticeable impact on short context tasks. But in line with aforementioned research, they observed that it became crucial for scaling from 4k to 64k tokens. The decision of whether to use intra-document attention masking can depend on model scale. For smaller models, some implementations choose to omit intra-document masking, finding that the additional complexity and potential reduction in cross-document learning doesn’t justify the benefits at those scales. However, for larger models, intra-document masking becomes more critical as the model’s capacity to learn from cross-document attention patterns diminishes relative to the benefits of cleaner document boundaries. embedding sharing Input embeddings token-to-vector lookup and output embeddings hidden states to vocab logits are typically represented as separate matrices, so the total embedding parameters are $2 \times \text{vocab size} \times \text{hidden dim}$. In small language models, this can account for up to 20% of total parameters, as is the case with Llama 3.2 1B in larger models, the embeddings represent a much smaller fraction of the parameter count, only 3% in Llama 3.1 70B . The issue with tying them is that input/output embeddings still represent different geometries, and frequent tokens like “the” can dominate representation learning due to getting gradients from both the input stream and the predicted output. Figure 2 : Comparison of untied embeddings separate input and output matrices vs tied embeddings shared matrix . Tied embeddings reduce parameter count while maintaining comparable performance. From PyTorch Blog https://pytorch.org/blog/advancing-low-bit-operators-in-pytorch-and-executorch-dynamic-kernel-selection-kleidiai-and-quantized-tied-embeddings/ . Hugging Face found that on a 1.2B model, tied embeddings did comparably well despite having 18% fewer parameters down from 1.46B , and that compared to an untied model also with 1.2B parameters fewer layers , untied showed higher loss and lower downstream eval scores. positional encodings Without positional encoding, transformers have no sense of word order, akin to the bag of words idea. Initially, absolute position embeddings https://arxiv.org/abs/1706.03762 were used by learning a lookup table that mapped the position index to a vector added to token embeddings, but the maximum input sequence length was limited by the sequence length it was trained on. Relative position encodings followed since capturing distance between tokens matters more than capturing their absolute positions. The most commonly used technique is rotary position embedding RoPE https://arxiv.org/abs/2104.09864 , which encodes position information by rotating query and key vectors in 2D planes. RoPE encodes relative position as rotation angles: based on the dimensionality of the query/key vector, RoPE splits it into pairs since they rotate in 2D space and rotates depending on the absolute position of a token and a base frequency. During attention, the dot product between their rotated positions directly encodes their relative distance via the phase difference in their rotation angles, where tokens $x$ positions apart always maintain the same angular relationship. Figure 3 : RoPE splits query/key vectors into pairs and rotates each pair by an angle proportional to position. From Su et al., 2021 https://arxiv.org/abs/2104.09864 . During pre-training, models are trained on shorter context lengths similar ideas to document masking, and quadratic attention is expensive to learn short range correlation between words. But as sequence length grows, the rotation angles grow via $\theta= \text{position} \times \frac1{\text{base}^{\frac{k}{\text{dim}/2}}}$. This can be fixed by increasing the base frequency as the sequence length increases using methods like ABF Adaptive Base Frequency https://arxiv.org/abs/2309.16039 or YaRN https://arxiv.org/abs/2309.00071 , which applies a more granular interpolation of frequencies on different components and includes other techniques like dynamic attention scaling and temperature adjustment. For extremely long contexts, YaRN does best, and in gpt-oss-120b, it was used to extend the context length of dense layers up to 131k tokens. More recently, with the emphasis on long contexts, NoPE https://arxiv.org/abs/2305.19466 no position embedding and RNoPE https://arxiv.org/abs/2501.18795 , a hybrid method, have emerged. NoPE uses only causal masking and attention patterns, so it doesn’t bump into the issue of extrapolating beyond training lengths but shows weaker performance on short context reasoning and knowledge-based tasks. RNoPE alternates applying RoPE and NoPE on attention blocks, where RoPE handles local context and NoPE helps with longer-range information retrieval. Another idea is Partial RoPE, which applies RoPE/NoPE within the same layer. Hugging Face ran ablations using RoPE, RNoPE removing positional encoding every 4th layer , and RNoPE with document masking. They found that all achieve similar performance on short-context tasks, so they adopt RNoPE + document masking because it provides the foundation for long-context handling. attention for long contexts Figure 4 : five common types of attention. From Hugging Face https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook . Note: This section covers attention pattern modifications which change which tokens can attend to which other tokens . These are distinct from positional encoding scaling methods like ABF/YaRN discussed in the “positional encodings” section , which adjust how position information is encoded without changing attention patterns. The following methods modify attention patterns to reduce computational cost: Chunked Attention : divides the sequence into fixed-sized chunks where tokens can only attend within their chunk. Llama 4 https://ai.meta.com/blog/llama-4-multimodal-intelligence/ pairs RNoPE specifically the RoPE layers which also reduces the KV cache size per layer, but its performance on long context tasks degraded. Sliding Window Attention SWA : every token can see up to $p$ positions back, creating a sliding window that maintains local context. Gemma 3 https://arxiv.org/abs/2503.19786 combined SWA with full attention every other layer. Dual Chunk Attention DCA : $K$ tokens are chunked into $M$ groups. Within each group like chunked attention , tokens attend normally. Between successive chunks, there is a local window to preserve locality, and more broadly, inter-chunk attention allows queries to attend to previous chunks with a capped relative position cap. Qwen-2.5 https://arxiv.org/pdf/2412.15115 used DCA to support context windows of up to 1 million tokens. Interleaving local and global attention alternates between layers that use local attention restricted to nearby tokens and global attention full sequence . This pattern balances computational efficiency with the ability to capture both local and long-range dependencies. Local layers reduce quadratic complexity while maintaining local context, and global layers ensure that distant relationships aren’t lost. When training encounters instability or loss spikes, adjusting the ratio of global layers for example, increasing their frequency can result in quicker loss recovery, as the model regains access to long-range information that may be crucial for certain patterns. The interleaving strategy is particularly effective for long-context models where full global attention would be computationally prohibitive. MoE MoEs mixture of experts , analogous to our brain activating different regions for different tasks, provide an alternative to dense models. At inference, only certain “experts” are activated based on the input, dramatically reducing compute compared to dense models where all parameters are active. The MoE works by replacing the feed forward layer with multiple MLPs experts and adding a learnable router before the MLPs to select the experts. The router typically uses top-k gating, selecting the $k$ experts with highest affinity scores for each token, where $k$ is usually much smaller than the total number of experts e.g., 8 out of 384 . Figure 5 : Comparison of dense architecture and MoE architecture. From Sebastian Raschka https://sebastianraschka.com/ . In general, for fixed number and size of active experts, increasing the total number of experts improves loss, and high sparsity improves performance https://arxiv.org/abs/2507.20534 and benefits more from increasing compute https://arxiv.org/abs/2507.17702 . Recent models are much more sparse, with over 100 experts and around 10 active per token. To determine how large each expert should be, a common metric is granularity, defined by $G = 2 \cdot \frac{d \text{model}}{d \text{expert}}$, where a higher granularity corresponds to more experts with a smaller dimension; this can be interpreted as a number proportional to the experts needed to match the dense MLP width. Recent models have granularity anywhere from 2 gpt-oss-120b to 8 qwen3-next-80b-a3b . Ant Group https://arxiv.org/pdf/2507.17702 showed that granularity doesn’t significantly change loss but does drive efficiency leverage the ratio of flops needed for an MoE to achieve the same loss as a dense model . And overall, MoEs present a good alternative to dense models in terms of compute for training and inference. Shared experts are always-on experts, which absorb the basic, recurring patterns so that other experts can more aggressively specialize; one is often enough DeepSeek-V2 https://arxiv.org/abs/2405.04434 uses two, which adds a bit of complexity . Load balancing is crucial in that if it fails, not only do training and inference efficiency plummet, but so do effective learning capacity. The routing mechanism typically uses top-k gating : for each token, the router computes affinity scores often via a learned linear projection followed by softmax , selects the top $k$ experts, and routes the token to those experts. To ensure balanced expert utilization, this can be addressed by adding a loss-based load balancer LBL given by $\mathcal{L} = \alpha \sum {i=1}^{N r} f i P i$ where $N r$ is the total number of experts, $\alpha$ determines the strength of the balancing term, $f i$ is the fraction of tokens routed to expert $i$, and $P i$ is the probability mass average routing probability for expert $i$; so in perfect load balancing, $f i=P i=\frac1{N r}$. Also, $\alpha$ should not be so large that routing uniformity overwhelms the primary training objective. These should be monitored using global statistics , not local statistics which may suffer from a local batch being narrow, biasing the routing statistics. DeepSeek-V3 https://arxiv.org/abs/2412.19437 does loss-free load balancing differently, by adding a bias term to affinity scores going into the routing softmax. Beyond bias-based approaches, several other routing and load balancing strategies have emerged. Some implementations use learnable routing functions that adapt during training, while others incorporate expert capacity constraints that prevent any single expert from being overwhelmed. The key insight across these methods is that effective load balancing must operate using global statistics aggregated across multiple batches, as local batch statistics can be misleadingly narrow and bias routing decisions. Sequence-wise auxiliary loss extends traditional auxiliary losses to promote balance within a sequence. Here, $T$ is the sequence length, $\alpha$ is a small coefficient, $\mathbb{1} \cdot $ is the indicator function which is 1 if its argument is true and 0 otherwise , and $K r$ is the number of active experts per token. Here, for each token at each position $t$ in the sequence, each expert $i$ is assigned a routing score $s {i,t}$, which is normalized so that $\tilde{s} {i,t}$ captures the proportion of the routing probability assigned to expert $i$ at position $t$. Averaging this over the whole sequence gives $P i$, which represents, on average, how often expert $i$ is considered for routing across the sequence. The $f i$ term furthers this by reflecting the fraction of times expert $i$ is actually selected i.e., is among the top $K r$ experts for a token, after bias terms $b i$ are added . The loss $\mathcal{L} {\text{Bal}}$ encourages the product $f i P i$ to be similar across different experts, pushing the model toward evenly distributing routing decisions and load; if any expert is used much more or less than others, the loss will increase, nudging the model back toward balanced expert activation. Auxiliary-loss free load balancing methods avoid introducing interference gradients by maintaining a bias vector $\mathbf{b}= b 1, \cdots, b {N r} $ which is updated in a decoupled fashion. Let $n i$ be the number of tokens routed to expert $i$ in the current step and $\bar{n}=\frac1{N r} \sum {i=1}^{N r} n i$ the mean load across all experts. $b i$ is updated by where $\gamma$ is the bias update speed, a sort of learning rate. This particular version includes the additional recentering of expert bias updates. Sequence-wise MoE Balancing with Uniformity SMEBU load balancing operates at the sequence level rather than the token level, ensuring that expert utilization remains balanced across entire sequences. The normalized per-expert violation is calculated by $v i=\frac{\bar{n}-n i}{\bar{n}}$ and $\tilde{v} i=\tanh \kappa v i $, which makes the scale independent of sequence length and batch size. Then $b i$ is updated using a momentum buffer with momentum factor $\beta$: $\tanh$ applies the soft-clamping, with tunable scale $\kappa$ to control saturation speed; $\tanh$ over $\text{sign} \cdot $ maintains the continuity and stability needed during training whereas $\text{sign}$ forces updates to be $\pm \lambda$, making the update step oscillate. Momentum also is introduced as a form of noise dampening, analogous to momentum SGD reducing variance in noisy gradient updates hybrid models Because transformers don’t deal efficiently with long context while RNNs can, one idea is to combine both to get the best of both worlds. By dropping the softmax from the output for token $t$: \ \mathbf{o} t = \sum {j=1}^t \frac{\exp \mathbf{q} t^\top \mathbf{k} j \mathbf{v} j}{\sum {l=1}^t \exp \mathbf{q} t^\top \mathbf{k} l } \Longrightarrow \mathbf{o} t = \sum {j=1}^t \mathbf{q} t^\top \mathbf{k} j \mathbf{v} j = \left \sum {j=1}^t \mathbf{v} j \mathbf{k} j^\top\right \mathbf{q} t\ where $\mathbf{q} t$, $\mathbf{k} j$, and $\mathbf{v} j$ are the query, key, and value vectors at positions $t$ and $j$, respectively, and $\mathbf{o} t$ is the output at position $t$. By defining $S t :=\sum {j=1}^t \mathbf{k} j \mathbf{v} j^\top$, then we get a recurrent relation where $S t$ summarizes all past $ k j, v j $ pairs: \ S t=S {t-1}+\mathbf{k} t \mathbf{v} t^\top \Longrightarrow \mathbf{o} t = S t \mathbf{q} t = S {t-1}\mathbf{q} t+\mathbf{v} t\left \mathbf{k} t^\top \mathbf{q} t\right \ where $S {t-1}$ is the state from the previous time step. While this gets us closer to an RNN-esque structure, in practice, softmax stabilizes training, and the linear form can cause instability without normalization. With RNNs, it is sometimes helpful to forget the past, by introducing a gate $\mathbf{G} t$ for the previous state \ \mathbf{S} t=\mathbf{G} t \odot \mathbf{S} {t-1} + \mathbf{v} t\mathbf{k} t^\top\ where $\odot$ denotes elementwise multiplication and $\mathbf{G} t$ is a learned gating mechanism. Mamba-2 https://arxiv.org/abs/2405.21060 is among the most popular, being used in hybrid models like Nemotron-H https://arxiv.org/abs/2504.03624 and Falcon H1 https://arxiv.org/abs/2507.22448 . Hybrid models are becoming increasingly popular, notably in Qwen3-Next https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list with a gated DeltaNet update and Kimi’s next model, likely using their “kimi delta attention.” https://github.com/fla-org/flash-linear-attention/pull/621 architecture takeaways - Use a proven dense baseline unless you have strong reasons and infra to support MoE. - GQA with small groups is a robust default; MQA is cheapest but tends to underperform. - For long context, plan for RNoPE/YaRN plus document masking early in the recipe. - Hybrid architectures are promising but still harder to reason about and operationalize. stability Training stability is crucial for successful large-scale model training. Several techniques help prevent training failures, including regularization methods, careful initialization, and architectural choices. The following sections cover key stability mechanisms: $z$-loss $z$-loss is a regularization term added to the standard cross entropy loss that keeps logits from drifting to large magnitudes. The softmax denominator is $Z = \sum {i=1}^V e^{z i}$, and by adding $\mathcal{L} = \lambda \cdot \log^2 Z $ to the loss, we penalize based on $\log Z $ which represents the overall logit scale. On their 1B model, Hugging Face found that adding $Z$-loss didn’t impact training loss or downstream performance, so they chose not to include it due to training overhead. For logit stabilization, logit softcapping see below is generally preferred in modern recipes, following the Gemma 2 https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf and Gemma 3 https://arxiv.org/abs/2503.19786 models. logit softcapping Logit softcapping prevents logits from growing excessively large by mapping them into a bounded range via a smooth, differentiable transformation. Unlike hard clipping which has zero gradient at the boundaries and can cause training instability , softcapping uses $\tanh$ to compress values smoothly. The Gemma 2 report https://arxiv.org/abs/2408.00118 introduces the formulation used in production models: cap logits such that values stay within $ -\texttt{soft cap}, +\texttt{soft cap} $ using where $\texttt{soft cap}$ is the threshold hyperparameter controlling the output range. The division normalizes inputs before $\tanh$ and the multiplication by $\texttt{soft cap}$ rescales to the desired interval. Unlike $z$-loss which adds a regularization term to the loss , softcapping operates directly on activations in the forward pass Gemma 2 https://arxiv.org/abs/2408.00118 applies softcapping to both attention logits pre-softmax and the final language modeling head. They set $\texttt{soft cap}=50.0$ for attention layers and $\texttt{soft cap}=30.0$ for the final layer. The technique traces back to Bello et al., 2016 https://arxiv.org/abs/1609.08144 in the context of neural machine translation. However, one caveat is that logit softcapping is incompatible with Flash Attention / SDPA during training because those fused kernels assume standard attention. The Hugging Face Gemma 2 blog https://huggingface.co/blog/gemma2 notes that for stable fine-tuning, you must use attn implementation="eager" ; inference can still use SDPA with minimal quality difference. This writeup https://danieldk.eu/Machine-Learning/Building-Blocks/Logit-Softcapping gives a concise technical overview. weight decay and embeddings Despite being a regularization technique, removing weight decay from embeddings can improve training stability. Weight decay causes embedding norm to decrease, but this can lead to larger gradients in earlier layers since the LayerNorm Jacobian has a $\frac1{\sigma}$ term coming from normalization which is inversely proportional to the input norm $\sigma$. Hugging Face tested this using a weight decay baseline, a no weight decay baseline, and another combining all previous adopted changes and found no significant loss or eval results, so they included no weight decay. QK-norm Similar to $z$-loss, QK-norm helps prevent attention logits from becoming too large by applying LayerNorm to both the query and key vectors before computing attention. However, the same paper which proposed RNoPE https://arxiv.org/abs/2501.18795 found that it hurts long-context tasks because the normalization de-emphasizes relevant tokens and emphasizes irrelevant tokens by stripping the query-key dot product of its magnitude. RMSNorm RMSNorm maintains comparable performance to LayerNorm while being computationally simpler, due to avoiding the mean-centering computation. A variant called depth-scaled sandwich norm applies normalization both before and after the attention/MLP blocks, with the normalization scale adjusted based on the layer depth: where $\mathbf{x} \ell$ and $\mathbf{y} \ell$ are input/output of layer $\ell$, $\mathcal{M} \ell$ is the sublayer module like attention, FFN, or MoE . The RMSNorm gain, $\gamma$, is a multiplicative factor applied after the RMS normalization, given by $\bar{a} i = \gamma\frac{a i}{\text{RMSNorm a }}$. In Arcee’s case, they initialize $\gamma\left \text{RMSNorm} \ell^{ 1 }\right =1$ and $\gamma\left \text{RMSNorm} \ell^{ 2 }\right =\frac1{\sqrt{L}}$. This depth-dependent scaling accounts for the fact that activations evolve differently across layers. The sandwich pattern pre-norm and post-norm provides additional stability, especially in very deep networks where gradient flow can be challenging. Arcee also applies RMSNorm before the language modeling head stabilizes the final hidden states to ensure consistent output activation scales before they are transformed into token probabilities. other design considerations Parameter initialization : either normalization initialization with $\mu=0$ and clipping as TruncDNormal initialization does often with $\pm 2-3 \sigma$ or a scheme like $\mu\text{P}$ maximal update parametrization https://arxiv.org/abs/2011.14522 which dictates how weights and learning rates should scale with width so that training dynamics stay comparable.- The clipping prevents extreme initialization values that could destabilize training, which is particularly important for embedding layers where large initial activations can propagate through the network. - Another heuristic is setting $\sigma=\frac{0.5}{\sqrt{d}}$ where $d$ is model dimension, although the exact coefficient can vary. - During the forward pass, the embedding layer’s activations are scaled by $\sqrt{d}$: $\mathbf{e} T=\sqrt{d} E \text{tok} t $. This keeps embedding magnitudes in a stable range relative to the residual stream and is common in several transformer implementations. Notably, Grok-1 and Grok-2 checkpoints as well as Trinity Large and the first two generations of the Gemma models implement this. Activation Function : SwiGLU is what most modern LLMs use, not ReLU or GeLU; for example, gpt-oss-120b uses gated SwiGLU. Some exceptions are Gemma2 using GeGLU and nvidia using $\text{relu}^2$. Width vs Height : deeper models tend to outperform equally sized wider ones on language modeling and compositional tasks. In smaller models, this is more pronounced, but larger models make use of wider models for faster inference due to modern architectures supporting better parallelism. stability takeaways - Stabilization is mostly about sane defaults, not exotic tricks. Logit softcapping Gemma-style is the preferred method for attention/LM-head logit stabilization; $z$-loss and QK-norm are alternatives.- QK-norm can hurt long-context tasks; don’t assume it’s “always good.” - Initialization and normalization details matter more as depth grows. - Track loss spikes early; many “mystery failures” are configuration or data issues. tokenizer There are a few considerations that typically guide tokenizer design: domains : in domains like math and code, digits and other special characters require careful treatment. Most tokenizers do single-digit splitting, which helps with arithmetic patterns more effectively and prevents memorization of numbers. Some tokenizers like Llama3 https://arxiv.org/abs/2407.21783 further encode numbers 1 to 999 as unique tokens. supported languages : a tokenizer trained on english text would be extremely inefficient if it encountered another language, say mandarin or farsi. target data mixture : when training a tokenizer from scratch, we should train on samples that mirror our final training mixture. Larger vocabularies can compress text more efficiently, but they come at the cost of a larger embedding matrix, which as mentioned in the embeddings section, can take up a sizable portion of the parameter count. For english-only models, 50k is often enough, while multilingual models need over 100k. There is an optimal size that exists since compression gains from larger vocabularies decrease exponentially https://arxiv.org/abs/2402.01035 . Large models benefit from large vocabularies since the extra compression saves more on the forward pass project to QKV, attention, and MLP than the additional embedding tokens during softmax. For memory, larger vocab means fewer tokens, so a smaller KV cache. BPE byte-pair encoding https://arxiv.org/abs/1508.07909 still remains the de facto choice. Starting with tiny units e.g. characters or bytes , the BPE algorithm repeatedly merges the most common adjacent pair into a new token. To evaluate a tokenizer’s performance, fertility is a common metric, measuring the average number of tokens needed to encode a word alternatively, characters-to-tokens ratio or bytes-to-tokens ratio, but these have limitations due to word length variability and byte representations . Another is proportion of continued words , describing what percentage of words get split into multiple pieces. For both, smaller metrics indicate more efficient tokenizers. There are many strong existing tokenizers, like GPT4’s tokenizer https://arxiv.org/abs/2303.08774 and Gemma3’s tokenizer. Often, using existing tokenizers is enough; only when we want to train for low-resource languages or have a different data mixture should we continue training our own tokenizer. optimizers and training hyperparameters Choosing optimizers and tuning hyperparameters is notoriously time-consuming and significantly impacts convergence speed and training stability. While we may be tempted to distill those from models of larger labs albeit a useful prior , it may not fit the use case. adamW Despite being invented over 10 years ago, AdamW still stands the test of time. Adam adaptive momentum estimation updates weights individually based on an exponential weighted average of gradients $g t$ and an exponential weighted average of squared gradients $g t^2$, along with weight decay the “W” . The exponential moving averages provide adaptive learning rates per parameter: parameters with consistently large gradients get smaller effective learning rates via the squared gradient term , while parameters with small or noisy gradients get larger effective learning rates. This adaptivity helps stabilize training and converge faster: \ \begin{align } \theta &\leftarrow 1-\alpha \lambda \theta - \alpha \frac{\hat{m} t}{\sqrt{v t}+\epsilon} \\ \hat{m} t &= \frac{m t}{1-\beta 1^t}, \quad m t = \beta 1 m {t-1} + 1-\beta 1 g t \\ \hat{v} t &= \frac{v t}{1-\beta 2^t}, \quad v t = \beta 2 v {t-1} + 1-\beta 2 g t^2 \end{align }\ where $\theta$ denotes the model parameters, $\alpha$ is the learning rate, $\lambda$ is the weight decay coefficient, $g t$ is the gradient at step $t$, $m t$ and $v t$ are the first and second moment estimates exponentially weighted averages , $\hat{m} t$ and $\hat{v} t$ are bias-corrected versions, $\beta 1$ and $\beta 2$ are exponential decay rates for the moment estimates, and $\epsilon$ is a small constant typically $10^{-8}$ to prevent division by zero. Even for modern LLMs, the hyperparameters remain largely unchanged: weight decay factor $\lambda=0.1$ or $\lambda=0.01$, $\beta 1=0.9$, and $\beta 2=0.95$. muon Unlike AdamW which updates per-parameter, muon treats the weight matrix as a singular object and updates based on matrix-level operations. This approach reduces axis-aligned bias where optimization favors certain coordinate directions and encourages exploration of directions that would otherwise be suppressed. By considering the entire weight matrix structure rather than individual parameters, muon can better capture correlations between parameters: \ \begin{align } g t &\leftarrow \nabla \theta \mathcal{L} t \theta {t-1} \\ B t &\leftarrow \mu B {t-1} + G t \\ O t &\leftarrow \text{NewtonSchulz5} B t \\ \theta t &\leftarrow \theta {t-1} - \eta O t \end{align }\ where $\theta t$ denotes the model parameters at step $t$, $\mathcal{L} t$ is the loss function, $g t$ is the gradient matrix, $G t$ is the normalized gradient matrix typically $G t = g t / |g t|$ , $B t$ is a momentum buffer matrix with $B 0=0$, $\mu$ is the momentum coefficient, $\eta$ is the learning rate, and $\text{NewtonSchulz5}$ applies the odd function $f x =3.4445x-4.7750x^3+2.0315x^5$. This blog https://docs.modula.systems/algorithms/newton-schulz/ and this blog https://kellerjordan.github.io/posts/muon/ describe the algebra of it in more detail as well as why the coefficients are what they are. The Newton-Schulz iteration approximates the matrix sign function: we can estimate the SVD decompositions of $G=U \Sigma V^\top$ by $UV^\top$, and $f x $ essentially replaces $\Sigma$ because iteratively applying $f$ i.e., $f \circ f \circ \cdots f x $ converges to the sign function, which normalizes the singular values. This has the effect of reducing axis-aligned bias and encouraging exploration of directions that would otherwise be suppressed. Muon is more sample-efficient than AdamW, especially at large batch sizes where AdamW struggles. Some implementations, including Arcee’s Trinity Large, choose a hybrid approach: using muon for hidden layers while keeping AdamW for embedding and output layers. This decision stems from the different optimization dynamics these layers exhibit—embeddings and output projections benefit from per-parameter adaptive learning rates, while hidden layers capture more benefit from muon’s matrix-level structure awareness. But since muon operates at the matrix level, applying NewtonSchulz requires access to the full gradient tensor. One method uses an overlapping round-robin scheme where each rank is responsible for gathering all gradient matrices corresponding to its index and applying muon locally. Since FSDP expects sharded gradients/updates, and every rank has its shard of the muon-updated gradient, then the optimizer step can proceed normally. However, this issues lots of overlapping collectives across many matrices which breaks at scale. The alternative that Prime adapts is based on all-to-all collectives which does bulk permutation so that each rank temporarily owns full gradients for its matrices, runs muon, then bulk permutes them back. This may require padding since many tensors are packed into contiguous buffers which can change the size that’s expected. However, this requires fewer collectives and scales better. Building on Muon, Kimi K2 introduces MuonClip , a stabilization technique that prevents exploding attention logits, which is a common failure mode in large-scale training. Other strategies include logit soft-cap https://arxiv.org/abs/2408.00118 , which applies $\tanh$ clipping to the pre-softmax logits, or QK-norm, which applies LayerNorm to the QK matrices. However, these lead to issues of the scaled dot-product exploding making bounding too late and distorted gradients around regions where the model is unstable in logit soft-cap, and key matrices are not materialized during inference projected from a latent variable . For each attention head $h$, consider $\mathbf{Q}^h$, $\mathbf{K}^h$, and $\mathbf{V}^h$ the query, key, and value matrices for head $h$ . For a batch $B$ and input representation $\mathbf{X}$, define the max logit as a per-head scalar to be the maximum input to softmax where $d$ is the dimension of the query/key vectors, $i$ and $j$ index positions in the sequence, and the $\frac1{\sqrt{d}}$ scaling factor matches the standard attention scaling. Set $S \text{max} = \max h S \text{max}^h$ the maximum across all heads and target threshold $\tau$ a hyperparameter controlling when clipping activates . The idea is to rescale $\mathbf{W} k^h$ and $\mathbf{W} q^h$ the key and query projection weight matrices for head $h$ whenever $S \text{max}^h$ exceeds $\tau$. Also, $\gamma=\min 1, \frac{\tau}{S \text{max}} $ the global clipping factor , one approach is to clip all heads simultaneously by \ \mathbf{W} q^h \leftarrow \gamma^\alpha \mathbf{W} q^h \quad \mathbf{W} k^h \leftarrow \gamma^{1-\alpha} \mathbf{W} k^h\ where the $\gamma$ exponentials enforce multiplicative weight decay for $\mathbf{Q}^h \mathbf{K}^{h\top}$; commonly, $\alpha=0.5$ to ensure equal scaling to queries and keys. However, not all heads exhibit exploding logits, which motivates a per-head clipping based on $\gamma h = \min 1, \frac{\tau}{S \text{max}^h} $, which is more straightforward for MHA but less for MLA. The challenge with MLA is that keys are projected from a latent variable rather than materialized directly, so clipping must be applied to the latent-to-key projection weights and the latent variable itself. They apply clipping only on $\mathbf{q}^C$ and $\mathbf{k}^C$ head-specific components scaled by $\sqrt{\gamma h}$, $\mathbf{q}^R$ head-specific rotary scaled by $\gamma h$, and $\mathbf{Q}^R$ shared rotary . Besides that, the main muon algorithm is modified to match Adam RMS and enable weight decay. For each weight $\mathbf{W} \in \mathbb{R}^{n \times m}$: \ \begin{align } g t &\leftarrow \nabla \theta \mathcal{L} t \theta {t-1} \\ B t &\leftarrow \mu B {t-1} + G t \\ O t &\leftarrow \text{NewtonSchulz5} B t \cdot \sqrt{\max n,m } \cdot 0.2 \\ \theta t &\leftarrow 1-\eta \lambda \theta {t-1} - \eta O t \end{align }\ where $n$ and $m$ are the dimensions of the weight matrix $\mathbf{W}$, $\sqrt{\max n,m } \cdot 0.2$ is a scaling factor that adapts the update magnitude to the matrix size matching Adam’s RMS scaling behavior , and other symbols follow the same definitions as in the standard Muon algorithm. The weight decay term $ 1-\eta \lambda $ is applied multiplicatively before the gradient update. Figure 6 : Left: a mid-scale training run on a 9B active, 53B total MoE where attention logits diverge quickly. Right: maximum logits for KimiK2 with MuonClip and $\tau=100$, where max logits eventually decays to a stable range after ~30% of the training steps. From Kimi K2 https://arxiv.org/pdf/2507.20534 . learning rates Learning rates have their own life cycle: they warmup typically 1%-5% of training steps for short trainings, but large labs fix the warmup steps from zero to avoid chaos, then anneal after settling into a good minimum. Cosine annealing https://arxiv.org/abs/1608.03983 was the go-to scheduler, but it’s also inflexible due to the cosine period needing to match the total training duration. Alternatives include warmup-stable-decay WSD https://arxiv.org/abs/2404.06395 and multi-step https://arxiv.org/abs/2401.02954 ; in the last x% of tokens, the former linearly decays the learning rate whereas multi-step does discrete drops. For WSD, typically 10-20% is allocated for the decay phase, matching cosine annealing; in multi-step, 80/10/10 also matches cosine annealing while 70/15/15 and 60/20/20 can outperform it. Deepseek-v3 used cosine annealing between the decay drops and added a constant phase before the final sharp step. Figure 7 : Comparison of learning rate schedules: cosine annealing, WSD, and multi-step. From Hugging Face https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook . Hugging Face’s ablations on their 1B model showed that WSD tended to underperform cosine annealing before WSD’s decay began, but once it entered its decay phase, WSD showed nearly linear improvement in both loss and eval metrics, which allowed it to catch up to cosine annealing by the end. After running further ablations on the learning rate, the Hugging Face team settled on 2e-4; increasing led to potential increased risk of instability during long training runs. Kimi K2 also uses WSD: the first 10T were trained with 2e-4 learning rate after a 500 step warm up, then 5.5T tokens with cosine decay from 2e-4 to 2e-5. WSD schedule especially helps with ablations since it does not require restarting the same run for different token counts, since we can retrain only the end portions learning rate decay while maintaining the front portion. batch size There is a critical batch size https://arxiv.org/abs/1812.06162 : too small and we may be underutilizing compute, but too large and the model needs more tokens to reach the same loss. Still, larger batch sizes give more efficient gradient estimations, and are preferred. A useful proxy is that for optimizers like AdamW or Muon, if the batch size increases by a factor of $k$ then the learning rate should scale up by $\sqrt{k}$. Intuitively, larger batches provide more stable gradient estimates lower variance , so we can afford larger step sizes. Mathematically, the covariance shrinks by a factor of $k$, and based on the SGD parameter update $\Delta w = -\eta g B$, we have $\text{Var} \Delta w \sim \eta^2 \frac{\Sigma}{B}$ where $B$ is the original batch size. To maintain the same update variance, we need $\eta \sim \sqrt{k}$. As training progresses, the critical batch size grows. Initially, since the model is making large updates, $\lvert \lvert g \rvert \rvert^2$ is large so the model should have a small critical batch size. After the model stabilizes, larger batches become more effective. This motivates the idea of batch size warmup . Imbalanced minibatches can arise when sequence packing or data distribution creates batches with highly variable sequence lengths or domain compositions, which can cause gradient variance that destabilizes training; this is especially true when certain experts or model components receive disproportionately many or few tokens. Arcee introduces random sequential document buffer RSDB to reduce intra-batch correlation. After tokenizing a document, it works by loading the token sequence as an entry in the RSDB with a read head at index 0; this is repeated until the RSDB is full. From a randomly sampled index in a randomly sampled document from the RSDB, tokens are read based on the read head and the index and added to a separate sequence buffer. Read head positions are updated, and if the sequence buffer is full, we return; otherwise, we randomly select another document index and continue to read tokens into the sequence buffer, repeating until the sequence buffer is full. The internal buffer size in Trinity Large: 8192 per GPU is set to twice the user-specified buffer value and refilled when the buffer reaches the user-specified value in Trinity Large: 4096 per GPU or when old documents need to be purged/new documents can be loaded. Arcee found that this optimization significantly improved dataloader performance. scaling laws Scaling laws e.g. Chinchilla scaling laws https://arxiv.org/abs/2203.15556 provide a useful proxy for determining how aggressively/conservatively to update hyperparameters as model size scales. First, $C \approx 6 \cdot N \cdot D$ where $C$ is the compute budget measured in FLOPs, N is the number of parameters, and $D$ is the number of training tokens. The 6 is derived from empirical estimates for the number of FLOPs per parameter. Figure 8 : Scaling curves of batch size and learning rate. From DeepSeek https://arxiv.org/abs/2407.05065 . Initially, scaling laws https://arxiv.org/abs/2001.08361 indicated that language model size was the main constraint, leading to a GPT-3 model with 175B parameters but only trained on 300B tokens. A re-derivation https://arxiv.org/abs/2203.15556 found that training duration could improve gains more than size; they found that compute-optimal training of GPT-3 should have consumed 3.7T tokens. However, scaling laws are almost never religiously followed. Recently, labs have been “overtraining” models beyond the training durations suggested by scaling laws e.g. Qwen 3 being trained on 36T tokens . Moreover, “compute-optimal” scaling laws don’t account for larger models being more expensive after training due to inference. To that end, Hugging Face decided to train on 11T tokens on a 3B model. For comparison, Kimi K2’s 1T model comprised of 15.5T pre-training tokens. While general scaling laws provide guidance, Kimi K2’s scaling law analysis revealed model-specific insights. They showed that an increase in sparsity , the ratio of total number of experts to the number of activated experts, yields substantial performance improvements for fixed FLOPs, so they increase the number of MoE experts to 384 256 in DeepSeek-V3 while decreasing attention heads to 64 128 in DeepSeek-V3 to reduce computational overhead during inference. They settle on a sparsity of 48, activating 8 out of 384 experts and found that decreasing the attention heads from 128 to 64 sacrificed a validation loss ranging from 0.5% to 1.2%, but a 45% decrease in inference FLOPs. data curation and pre-training Even with the perfect architecture, a model’s performance is still heavily dependent on its training data; no amount of compute or optimization can compensate for training on the wrong content. To this end, it’s about assembling the right data mixture , balancing training objectives and tuning data proportions. This is particularly difficult since across competing objectives, for a fixed compute budget, increasing one proportion necessarily decreases another, hurting performance. There already exist large corpora of pre-training datasets like FineWeb2 https://arxiv.org/abs/2506.20920 and The Pile https://pile.eleuther.ai/ . However, there are still a plethora of information gaps, so recent models additionally rely on specialized pretraining datasets for domains like math and coding. One consideration is data quality . Of course, training on the highest quality data possible is preferable. But for a training budget of $X$ tokens, because high quality data is limited, only filtering for it would lead to repeated data, which can be harmful https://arxiv.org/abs/2305.16264 . So, an ideal mixture includes both higher and lower quality data. Another consideration is model safety . For gpt-oss-120b, OpenAI addresses this by filtering the data for harmful content in pre-training, with an emphasis on hazardous biosecurity knowledge. They use CBRN chemical, biological, radiological, and nuclear pre-training filters that were used in GPT-4o. multi-stage training Multi-stage training https://arxiv.org/abs/2502.02737 , the idea of evolving the data mixture as training progresses, can better maximize both high-quality and lower-quality data compared to a static mixture because a LM’s final behavior is heavily dictated by the data it sees at the end of training https://arxiv.org/abs/2410.08527 . So, this motivates the strategy of saving the higher quality data towards the end. This introduces another variable of when to begin changing mixtures, and a general principle to performance-driven intervention : if a benchmark begins to plateau, it’s a signal to introduce high-quality data for that domain. ablation While architectural ablations are done on smaller models e.g. on 1B models to train for 3B models , data mixture ablations are done at scale because larger models have much larger capacities to understand a variety of domains. Moreover, annealing ablations are done on checkpoints of the main run like 7T out of 11T tokens to determine what datasets to introduce when. To determine optimal data proportions, recent models often use a validation loss or a holdout loss to minimize based on evaluation objectives and data domains. However, some of these methods tend to converge toward distributions that mirror the dataset size distribution, and they don’t outperform careful manual ablations. token utility Token efficiency is how much performance improvement is achieved per token consumed during training. This can be improved via better token utility , the effective learning signal each token contributes; this motivates finding the optimal balance of high-quality tokens, since they should be maximally leveraged but also limited to prevent overfitting and reduced generalization. Kimi K2 uses data rephrasing in knowledge and math domains. For knowledge, this comes in the form of style and perspective-diverse prompting to rephrase the texts, chunk-wise autoregressive generation to gradually build a rephrased version of long documents, and fidelity verification to ensure semantic alignment. In the main training run, each corpus is rephrased at most twice. For math, diversity is increased via rephrasing into a “learning-note style” and translation into other languages. pre-training data SmolLM3 Hugging Face’s goal was to build a multi-lingual model that also excels on math and coding. In stage 1 of their multi-stage training, they use a 75/12/10/3 split among english web data, multilingual web data, code data, and math data. English web data : they ablate on a mixture of FineWeb-Edu educational and STEM benchmarks and DCLM common sense reasoning , two strong open English web datasets at the time of training, finding that a 60/40 or a 50/50 split was best. Later, they add in other datasets including Pes2o https://huggingface.co/datasets/allenai/dolmino-mix-1124/tree/main/data/pes2o , Wikipedia & Wikibooks https://huggingface.co/datasets/allenai/dolmino-mix-1124/tree/main/data/wiki , and StackExchange https://huggingface.co/datasets/HuggingFaceTB/stackexchange 2025 md . Multilingual web data : five European languages were chosen, with data from FineWeb2-HQ. Smaller portions of other languages, like Chinese or Arabic, were chosen to allow others to do continual pretraining of SmolLM3. Ultimately, they found that 12% multilingual content in the web mix was best. Code data : primarily extracted from The Stack v2 and StarCoder2 https://arxiv.org/abs/2402.19173 , it includes 16 languages, Github PRs, Jupyter/Kaggle notebooks, Github issues, and StackExchange threads. Despite research showing that code improves LM performance beyond coding, they did not observe this effect rather a degradation on English benchmarks using the recommended code mixture. They delay adding their educationally filtered subset, Stack-Edu, following the principle of delaying the best data until the end. Math data : using FineMath3+, InfiWebMath3+, MegaMath https://arxiv.org/abs/2504.02807 , and instruction/reasoning datasets like OpenMathInstruct https://arxiv.org/abs/2402.10176 and OpenMathReasoning https://arxiv.org/abs/2504.16891 . For new stages using a checkpoint at around 7T out of the total 11T tokens , they use a 40/60 split between the baseline mixture and the new dataset. SmolLM3 has three stages: 8T tokens @ 4k context for base training, 2T tokens @ 4k context for high-quality injection, and 1.1T tokens @4k context a reasoning/Q&A stage. hermes 4 Using data from DCLM https://arxiv.org/abs/2406.11794 and FineWeb, Nous first performs semantic deduplication using embeddings at a cosine similarity of 0.7, and then uses an LLM-as-judge to filter out incomplete or ill-formatted messages. Then, they process pre-training data through DataForge , a graph-based synthetic data generator, which allows for large and complex structures. By taking a random walk through a directed acyclic graph where nodes implement a mapping from struct $\to$ struct such that if there is an edge from node $A$ to node $B$, the postconditions guaranteed by $A$ must satisfy the preconditions of $B$. QA pairs are generated using this workflow with intermediary transformations into other mediums e.g. a wikipedia article into a rap song , question generation and then questions/answers annotations using an LLM-as-judge to grade the instruction and response. Also, to find a covering set of data-scarce domains of special interest, they recursively depth-first-search generate a taxonomy of subdomains where the leaves are prompts and the LLM enumerates $n$ subdomains to form a partition. The DataForge-generated data is used in both pre-training and post-training stages, with specific details provided in the post-training data section below. data takeaways - Data quality and mixture often dominate architecture tweaks at fixed compute. - Multi-stage schedules help: save the best data for late training to shape final behavior. - Deduplication and contamination checks are non-optional if you care about honest evals. - Ablate data mixtures at scale; small-model ablations can be misleading. mid-training Mid-training is the intermediary step between pre-training and post-training where the base model is trained further on a large amount of domain-specific tokens, especially shaping the model to focus on common core skills like coding or reasoning. Often-times, the decision to mid-train is only made after initial SFT experiments are run because they may reveal performance gaps that indicate the need to mid-train on certain domains. But if the goal is to elicit shallow capabilities like style or conversation, the compute is better spent in post-training. Some recipes include an additional long context stage ; for example, Qwen3 https://arxiv.org/abs/2505.09388 first trained on 30T tokens at 4k context, then a reasoning stage with 5T higher-quality tokens mainly on STEM and coding, and finally a long context stage at 32k context length. SmolLM3 also does this, but instead of scaling from 4k to 128k directly, they sequentially scale from 4k to 32k to 64k to 128k, which allows the model to adapt at each length before pushing the context length further. Upsampling long context documents like web articles or books improve long context https://arxiv.org/abs/2410.02660 , but Hugging Face didn’t observe improvement; they hypothesize that this is because their baseline mixture already includes long documents using RNoPE. To go from 4k to 32k and later to 64k, they use RoPE ABF and increase the base frequency to 2M and 5M, respectively. Base frequencies like 10M further improved slightly on RULER https://arxiv.org/abs/2404.06654 , long context benchmark, but it hurt short context tasks like GSM8k, so they were disregarded. To reach 128k, they found that using YARN from the 64k checkpoint instead of using a four-fold increase from 32k produced better performance, which confirms the hypothesis that training closer to the desired inference length benefits performance. Kimi K2 decays learning rate from 2e-5 to 7e-6, training on 400B tokens with 4k sequence length, then 60B tokens with a 32k sequence length. To extend to 128k, they use YARN. While the mid-training data usually comes from web data, another powerful approach is to use distilled reasoning tokens from a better model, as Phi-4-Mini-Reasoning did from DeepSeek-R1. When applied to the base model, distilled mid-training increased benchmark scores like AIM24 by 3x, MATH-500 by 11 points, and GPQA-D by almost 6 points. SmolLM3 also does distilled mid-training. They considered datasets including reasoning tokens from DeepSeek-R1 4M samples and QwQ-32B 1.2M samples but decide to delay using the Mixture of Thoughts https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts dataset until the final SFT mix. They found that it almost always makes sense to perform some amount of mid-training if the base model hasn’t already seen lots of reasoning data during pre-training, because they noticed that /no think reasoning mode also had improvements on reasoning benchmarks. post-training evals Given today’s standards of LLMs as coding agents and assistants that can reason, there are four broad classes of evals that researchers care about: Knowledge : for small models, GPQA Diamond tests graduate-level multi-choice questions and gives better signal than other evals like MMLU. Another good test for factuality is SimpleQA, although smaller models are much less performant due to limited knowledge. Math : AIME is still the leading benchmark, with others like MATH-500 providing a useful sanity check for small models. Code : LiveCodeBench tracks both coding competency via competitive programming while SWE-bench Verified is a more sophisticated alternative but much harder for smaller models. Multilinguality : there aren’t many options except for Global MMLU to target the languages that models were pretrained on/should perform well in. These evals test the following: Long context : RULER, HELMET, and more recently-released MRCR and GraphWalks benchmark long-context understanding. Instruction following : IFEval uses verifiers against verifiable instructions, and IFBench extends upon it with a more diverse set of constraints. For multi-turn, Multi-IF and MultiChallenge are preferred. Alignment : LMArena with human annotators and public leaderboards is the most popular. But due to the cost of these evaluations, LLM-as-judge evals have emerged, including AlpacaEval and MixEval. Tool calling : TAU-Bench tests a model’s ability to use tools to resolve user problems in customer service settings, including retail and airline. To prevent overfitting, evals that encapsulate robustness or adaptability, like GSMPlus which perturbs problems from GSM8k, are also included. Another way is using interval evals or vibe evaluations/arenas , such as manually probing the model’s behavior. Other tips include using small subsets to accelerate evals especially if there’s correlation with a larger eval , fixing the LLM-as-judge model if the eval requires it , treat anything used during ablations as validation, use avg@k accuracy, and try not to don’t benchmax post-training data intellect 3 It’s first worth mentioning that Intellect-3 is a 106B parameter MoE 12B activate post-trained on top of GLM-4.5-Air base model from Z.ai, and that they have their own post-training stack including prime-rl , an open framework for large-scale asynchronous RL, verifiers library for training and evals from their Environments Hub, sandbox code execution and compute orchestration. Integrating with the Environments Hub, Prime trains on a diverse and challenging mix of environments designed to improve coding and reasoning capabilities. For math, they design an environment with long CoT reasoning in mind, consisting of 21.2K challenging math problems from Skywork-OR1, Acereason-Math, DAPO, and ORZ-Hard, all of which are curated datasets derived from AIME, NuminaMath, Tulu3 math, and others which test difficult math questions from multiple choice to proofs to those involving figures. Even using verifiers, there were a non-trivial amount of false negatives, so they additionally use opencompass/CompassVerifier-7B as a LLM-judge verifier. For science mainly physics, chemistry, and biology , they filter 29.3K challenging problems from MegaScience https://arxiv.org/abs/2507.16812 while also using LLM-judge verification and standard math verifiers. For logic games like Sudoku or Minesweeper , 11.6K problems and verifiers were adapted from SynLogic https://arxiv.org/abs/2505.19641 . For code, they primarily use their Synthetic-2 dataset https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2 along with Prime Sandboxes to verify solutions. They also develop two SWE environments that support scaffolding for common formats like R2E-Gym https://arxiv.org/abs/2504.07164 , SWE-smith https://arxiv.org/abs/2504.21798 , and Multi-SWE-bench https://arxiv.org/abs/2504.02605 to fix issues within a Github project when equipped to Bash commands and edit tooling. Also, the maximum number of turns for the agent is set at 200. Prime also focuses on its deep research capabilities via their web search environment, which provides the model with a set of search tools. The environment tasks the model with answering questions from the dataset using tools and is rewarded either 1 or 0 using z-AI’s DeepDive dataset https://huggingface.co/datasets/zai-org/DeepDive , with 1K samples for SFT trajectory generation and 2.2K samples for RL. When tested in Qwen/Qwen3-4B-Instruct-2507, 26 steps of SFT with batch size of 34 followed by 120 steps of RL at a group size of 16 and batch size of 512 was enough to reach mean reward of 0.7. hermes 4 They use 300k prompts, mostly STEM and coding from WebInstruct-Verified https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified , rSTAR-Coder https://arxiv.org/abs/2505.21297 , and DeepMath-103k https://arxiv.org/abs/2504.11456 and apply deduplicating and filtering for prompts with 2k characters. Nous rejection samples against ~1k task-specific verifiers using Atropos https://nousresearch.com/introducing-atropos/ . Some environments used to generate the dataset include Answer Format Training : rewards succinctly-presented final answers, like $\mathtt{\backslash boxed{}}$ in LaTeX, but there are over 150 output formats sampled. The environment also enforces <think and </think delimiters. Instruction Following : leverages RLVR-IFEval https://huggingface.co/datasets/allenai/RLVR-IFeval for sets of verifiable tasks with constraint instructions like “Every $n^\text{th}$ word of your response must be in French.” Schema Adherence : facilitates generation producing a valid JSON object from natural language prompt and a schema and editing identifying and correcting validation errors within a malformed JSON object Tool Use : facilitates agentic behavior by training the model to generate reasoning and produce tool calls via the <tool call token. kimi k2 A critical capability that Kimi K2 chooses to focus on is tool use. While benchmarks like $\tau$-bench and ACEBench exist, it’s often difficult to construct real-world environments at scale due to cost, complexity, privacy, and accessibility. Kimi K2 builds off of ACEBench’s data synthesis framework to simulate real-world tool-use scenarios at scale: Tool spec generation : constructing a large repo of tool specs from real-world tools and LLM-synthetic tools Agent and task generation : for each tool-set sampled from the repo, generate an agent to use the toolset on some corresponding tasks Trajectory generation : for each agent/task, generate trajectories where the agent finishes the task Using 3k+ real MCP tools from Github and 20k synthetic tools generated hierarchically in domains like financial trading, software applications, and robot control. Diversification among agents is ensured via the combination of distinct system prompts with distinct tool combinations, and tasks are graded using an explicit rubric with an LLM judge. For RL, Kimi K2’s treatment for math, STEM, and logical tasks remains similar to those of other models. Coding and software engineering comes largely from competition-level programming problems and PRs/issues from GitHub. For instruction following, they use two verification mechanisms: deterministic evaluation via code interpreters for verifiable outputs and LLM-as-judge evaluations for non-verifiable outputs. The data was constructed using expert-crafted prompts and rubrics, agentic instruction augmentation inspired by AutoIF, and a fine-tuned model specialized for generating additional instructions probing specific failure modes or edge cases. chat template A few important considerations for designing/picking a good chat template include system role customizability , tool calling , reasoning , and compatibility with inference engines like vLLM or SGLang. Qwen3 and GPT-OSS satisfy all criteria, and Qwen3 is designed for hybrid reasoning. In SmolLM3, despite also being designed for hybrid reasoning, they discard the reasoning content for all but the final turn in the conversation to avoid blowing up the context during inference, but for training, it’s important to retain the reasoning tokens to condition the model properly. So, Hugging Face orchestrates their own chat template, satisfying all criteria. Vibe tests initially revealed a bug of not passing in the custom instructions into their custom template, but this was quickly patched. While deriving inspiration from the Qwen3 template, Intellect-3 always reasons not hybrid by proxy of being dominantly trained on reasoning-only SFT traces; they use qwen3 coder tool call parser and deepseek r1 reasoning parser to ensure reasoning chains are consistently represented. gpt-oss-120b uses the harmony chat template, which introduces “channels” that determine the visibility of each message. For example, final for answers shown to the user, commentary for tool calling, and analysis for CoT tokens. This allows the model to interleave tool calls with CoT. Hermes 4 adapts Llama 3’s chat template by changing the assistant to a first-person identifier after identifying the sensitivity to the token used for the assistant’s turn: \ \mathtt{<\lvert start\ header\ id\rvert assistant<\lvert end\ header\ id\rvert } \longrightarrow \mathtt{<\lvert start\ header\ id\rvert me<\lvert end\ header\ id\rvert }\ This results in markedly different behaviors, which is explored more in “behaviors and latent capabilities” subsection of “behaviors and safety” section. DeepSeek-R1-Zero’s chat template looks very similar to others, but additionally includes $\mathtt{ sft Most post-training pipelines start with supervised fine-tuning SFT because it’s cheap compared to RL, stable due to insensitivity to reward design and hyperparameters, and gives a strong baseline off of the base model. Usually, base models are too unrefined to benefit from more advanced post-training methods. SFT often comes in the form of distillation from stronger models . Strong models may suffer from success and skip the SFT stage because there are no stronger models to distill from such is the case with DeepSeek R1-Zero . Dataset curation for SFT is important; datasets might seem great on paper, but models trained on those datasets may end up overindexing on certain domains, such as science. To this end, Hugging Face curated a data mixture with ~100k examples and 76.1M tokens, mostly consisting of instruction following, reasoning, and steerability for both think and non-think modes. Importantly, data should be paired across modes because otherwise, there is not an indication of when to give a concise answer or use extended reasoning. For training, there are other considerations as well: full finetuning vs more parameter efficient methods like LoRA or QLoRA, specialized kernels like FlashAttention which reduces memory usage by recomputing attention on-the-fly during the backward pass, trading compute for memory or the likes of SonicMoE for more efficient compute usage, masking the loss for only assistant tokens, the type of parallelism needed, learning rate tuning, and sequence length tuning to match the distribution of data to speed up training more useful for larger datasets . Cute cross-entropy kernel CCE is a memory-efficient CUDA kernel for computing cross-entropy loss. Instead of materializing the full logit matrix in global memory, CCE computes only the logit for the correct token and evaluates the log-sum-exp over all vocabulary items on-the-fly using faster memory tiers, dramatically reducing memory consumption. The kernel leverages the sparsity of softmax by skipping gradient computation for elements with negligible contributions below numerical precision. This makes it particularly valuable for models with large vocabularies. In Intellect-3, Prime splits SFT into two stages: general reasoning SFT and agentic SFT . In the first, they use datasets consisting of math, code, science, tooling, chat, and instruction splits from Nemotron’s post-training dataset https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1 and AM-DeepSeek-R1-0528-Distilled https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-0528-Distilled for a total of 9.9B tokens. In the second stage, they target agentic behavior, tool use, and long-horizon control gpt-oss-120b also targets agentic behavior and tool use , using a mix of open-source agentic datasets like SWE-Swiss https://github.com/zhenyuhe00/SWE-Swiss and synthetically-generate datasets from the Environments Hub using DeepSeek-R1. Besides serving the purpose of fine-tuning for agentic behavior, this stage also has the effect of pushing the model toward longer effective context lengths. Using context parallelism , they scaled from a 65K context window to 98K. In Hermes 4, they also do two stages of SFT, both around reasoning. They noted that despite training on sequences at most 16k tokens in length, the reasoning lengths frequently exceed 41k tokens on reasoning tasks. So, they do a second stage to teach the model to generate the closing $\mathtt{</think }$ tags at 30k tokens, their budget. This insertion at a fixed token count allows the model to learn a counting behavior “when you reach $N$ tokens, stop” while ensuring that the model’s own distribution doesn’t change significantly. This also avoids the problem of model collapse when recursive training on full, self-generated outputs leads to distribution narrowing and quality degradation. capabilities The Hugging Face team found issues in generalising single-turn reasoning data to multi-turn data, stemming from the difficulty in differentiating /think and /no think tags between turns. So, they constructed a new dataset, IFThink, using Qwen3-32B that augmented single-turn instructions into multi-turn exchanges with verifiable instructions and reasoning traces; this dramatically improved multi-turn reasoning. Masking user turns is another design choice, since otherwise loss is computed on user queries as well, sacrificing producing high-quality assistant responses for predicting user queries. In practice, masking doesn’t have a huge impact on downstream evals, but still yielded improvements by a few points in most cases. sequence packing Sequence packing is another choice that improves training efficiency. The idea is similar to intra-document masking where sequences are packed into a batch so as to not waste padding compute via excessive padding tokens, but with the additional constraint of minimizing truncation of documents across batch boundaries. Figure 9 : Comparison of sequence packing strategies. From Hugging Face https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook . In the image, the last packing method uses the best-fit decreasing implemented in TRL , where each sequence is placed in the batch that minimizes the remaining space after insertion. Another method, which Hermes-4 uses, is first-fit decreasing , which places a sequence in the first batch that has enough remaining space, which achieves $ 99.9\%$ batch efficiency. Despite yielding up to a 33x tokens/batch/optimization step, for a fixed token budget, packing alters training dynamics since the more data means fewer gradient updates. This especially hurts small datasets where each sample matters more. An effective batch size of 128 hurt evals like IFEval by up to 10%; for effective batch sizes larger than 32, there was an average drop in performance for SmolLM3 and the dataset . But for large datasets, packing is almost always beneficial . learning rate and epochs The learning rate for SFT is usually an order of magnitude smaller than that for pre-training since the model has already learned rich representations, and aggressive updates can lead to catastrophic forgetting. And because SFT runtime is so short compared to pre-training, it makes sense to do full learning rate sweeps. For SmolLM3, learning rates of 3e-6 or 1e-5 worked best. When packing is enabled, it’s safer to decrease the learning rate further due to the larger effective batch size and getting fewer updates for the same token budget. Once a good data mixture is identified and hyperparameters are tuned, training on more than one epoch what was usually done in ablations also leads to increased performance by a few percentage points; on LiveCodeBench v4, performance nearly doubled from epoch two to three. An interesting idea explored is whether the optimizers for pre/post-training should be the same. AdamW remains the default choice for both pre/post-training, and when tested with Muon, using the same optimizer still yielded the best performance. preference optimization PO Because SFT is fundamentally a form of imitation learning , extremely large SFT datasets can be redundant due to diminishing gains or failure modes that aren’t encapsulated in the data. Another useful signal is preference , i.e. which response, A or B, is preferred and enables model performance to scale beyond the limits of SFT alone. Also, less data is needed for preference optimization than SFT since the starting point is already strong. For generating preference datasets, there are a few methods: Strong vs weak : for fixed prompts $x$, strong model $S$ and weak model $W$, always prefer the stronger model’s output $y S$ over the weaker model’s output $y W$. This is easy to construct since the stronger model’s output is reliably better. With methods like DPO, the difference between strong and weak responses can be enforced. On-policy with grading : using the same model and prompt, generate multiple candidate responses and have an external model e.g. LLM-as-judges that score responses using rubrics or verifiers that provides preference labels. This requires a well-calibrated and reliable LLM-as-judge, but also allows for ongoing bootstrapping of preference data. While preference optimization is generally thought as a medium to improve helpfulness or alignment, it can also teach models to reason better, like using strong-vs-weak preferences. There are typically three hyperparameters that affect training dynamics: Learning rate : when tested across sizes of being 2x to 200x smaller than the learning rate used in SFT, Zephyr7B found that using a 10x smaller lr provided best performance improvement, and SmolLM3 ended using a 20x smaller lr 1e-6 to balance performance between /think and /no think modes. $\beta$ : ranging from 0 to 1, it controls whether to stay closer to the reference model loss $\beta$ or closer to the preference data higher $\beta$ . If too large, it could erase capabilities from the SFT checkpoint, so $\beta$ values around 0.1 or higher are usually preferable preference dataset size : when tested with sizes from 2k to 340k pairs, performance largely remained stable, although Hugging Face noted performance drops in extended thinking for datasets beyond 100k pairs. To that point, don’t be afraid to create your own preference data, especially with how cheap inference has become. algorithms Besides vanilla DPO direct preference optimization https://arxiv.org/abs/2305.18290 , researchers have explored a variety of alternatives: KTO Kahneman-Tversky Optimization https://arxiv.org/abs/2402.01306 : instead of pairs, KTO assigns updates based on whether a sample is labeled desirable/undesirable, taking ideas from human decision making along with a reference point $z 0$ and a reward-like log-ratio term. ORPO odds ratio preference optimization https://arxiv.org/abs/2403.07691 : incorporates PO with SFT via an integrated odds ratio term to the cross-entropy loss. This makes it more computationally efficient since there is no need to use a separate reference model that is used in DPO to compute $r \theta x,y $. APO anchored preference optimization https://arxiv.org/abs/2408.06266 : rather than just optimizing the difference between $y^+$ and $y^-$ in DPO, APO-zero forces $y^+$ up and $y^-$ down while APO-down pushes both $y^+, y^-$ down useful if the quality of $y^+$ is below that of the current model Hugging Face found that APO-zero had the best overall out-of-domain performance. RL SFT and PO can hit ceilings because fundamentally, they optimize to produce outputs that look like the dataset and PO is often off-policy and weak at multi-step credit assignment. RL reinforcement learning helps by providing a reward signal through interaction with an environment. Verifiers can automatically check correctness and provide those reward signals, and objectives can be optimized for beyond preference labels. In RLHF RL from human feedback , human comparisons are provided, and a reward model is trained to predict the human preference signal. Then, the policy is fine-tuned with RL to maximize the learned reward. This way, RLHF does on-policy optimization since it does rollouts using the current policy used in training and updates based on the reward given by the reward model. This also allows RLHF to discover behaviors not present in the preference dataset. In RLVR RL with verifiable rewards , popularised by DeepSeek-R1, verifiers check whether a model’s output matches criteria e.g. whether it produced the correct math answer or passed all code unit tests to generate reward signals, making it more scalable and objective. Then, the policy is fine-tuned to produce more verifiably-correct outputs. RLVR is particularly valuable when reward drift is a concern verifiers provide very stable signals compared to learned reward models , when KL control is needed to prevent policy collapse, and when addressing stale-policy artifacts in multi-step reasoning tasks. Although policy optimization algorithms are commonly on-policy, like GRPO, in practice, to maximize throughput, they may actually be slightly off-policy. For example, in GRPO, without freezing the policy, generating multiple rollout batches and doing optimizer updates sequentially makes only the first batch on-policy and all subsequent batches off-policy; this is known as in-flight updates . In-flight updates matter most when throughput is critical e.g., large-scale RL training , when reward drift could accumulate from stale policies, when KL divergence between inference and training policies needs careful monitoring, and when long rollouts span multiple policy updates. The tradeoff is between training efficiency and policy consistency; techniques like importance sampling clipping as in IcePop help mitigate the off-policy bias. Figure 10 : Comparison of conventional RL and in-flight updating. From Pipeline RL paper https://arxiv.org/abs/2509.19128 . In the context of Intellect-3, which uses a CPU orchestrator between two clusters one for training and one for inference , the orchestrator continuously polls the trainer to update the inference pool once a new policy is available, and the inference pool temporarily halts generation to update the weights, then continues with rollouts. In this way, a long rollout could be generated by multiple policies, but they limit this by a max off policy steps parameter to limit policy drift. Also, they implement IcePop https://www.emergentmind.com/topics/icepop to stabilize MoE training: where $\mathcal{M} k =k$ if $k \in \alpha, \beta $ and $0$ otherwise. The purpose of $\mathcal{M}$ is to account for the off-policy nature between the training policy and the inference policy such that they don’t diverge significantly for each token. When the importance weight $k$ falls outside $ \alpha, \beta $, $\mathcal{M}$ clips it to 0, effectively ignoring that token’s contribution to the gradient. This is the idea behind importance sampling , where rollouts come from the inference policy, but we are optimizing for the training policy. Prime uses the default $\alpha = 0.5, \beta = 5$. $\alpha$ and $\beta$ need not be symmetric in the multiplicative sense. One reason for this is under the rare instance when $\pi \text{infer}$ is small large $k$ , a tighter $\beta$ would clip the high-entropy tokens, which would make learning dynamics worse. Kimi K2 adapts a different policy optimization approach from their previous model K1.5: \ L \text{RL} \theta = \mathbb{E} {x \sim \mathcal{D}}\left \frac1{K} \sum {i=1}^K\left \left r x, y i - \overline{r} x - \tau \log\frac{\pi \theta y i | x }{\pi \text{old} y i | x }\right ^2\right \right \ where $\overline{r} x = \frac1{K} \sum {i=1}^K r x,y i $ is the mean reward of sampled responses and $\tau 0$ is a regularization parameter for stable learning, akin to KL divergence. Another consideration is PTX loss : pre-training cross-entropy loss. During joint RL training, the model can catastrophically forget valuable, high-quality data. So, they curate a dataset using hand-selected, high-quality samples and integrate them into the RL objective via the PTX loss. The advantages are twofold: high-quality data can be leveraged, and the risk of overfitting to the tasks present during RL can be mitigated, which leads to better generalization. To balance exploration and exploitation throughout training, they implement temperature decay . For tasks like creative writing and complex reasoning, high temperatures during the initial stages is important to generate diverse and innovative responses; this prevents premature convergence to local minima and facilitates the discovery of effective strategies. At later stages, the temperature is decayed following a schedule so that there is not excessive randomness, and that it doesn’t compromise the reliability/consistency of the model’s outputs. DeepSeek-R1-Zero stands out as an exception compared to other models because it demonstrates that RL can be effective even without supervised fine-tuning. It cold-starts RL specifically, GRPO to save training costs; they use 10.4k steps, batch size of 512, reference policy replacement every 400 steps along with lr as 3e-6, KL coefficient as 0.001 on reasoning tasks without any supervised data. The reward system uses two types of rewards: accuracy rewards based on correctness of the response and format rewards that enforce the model to put its thinking process between thinking tags. It obtains robust reasoning capabilities using pure RL, which validates the ability to learn reasoning and generalize effectively. Moreover, behaviors including reflection re-evaluating previous steps and exploring alternative approaches to problem-solving emerge, which further enhances reasoning. The count of reflective words such as “wait” or “mistake” rises 5-7-fold compared to the start of training. Moreover, this reflective behavior appeared relatively suddenly: between steps 4k-7k, there was only occasional usage, but after step 8k, it exhibited significant spikes. Figure 11 : The multi-stage pipeline of DeepSeek-R1 with intermediate checkpoints DeepSeek-R1 Dev1, Dev2, and Dev3. From DeepSeek https://arxiv.org/pdf/2501.12948 . For DeepSeek-R1, they collect thousands of long CoT data to finetune the DeepSeek-V3-Base as the starting point for RL. From DeepSeek-R1-Zero, they learned that readability was an issue: responses sometimes mixed multiple languages or lacked markdown formatting for highlighting answers. They remedy the former by introducing a language consistency reward portion of target language words in the CoT , and the latter by designing a readable pattern that includes a summary at the end of each response. They also perform a second RL stage aimed at improving the model’s helpfulness and harmlessness while retaining reasoning capabilities. For helpfulness , they focus on emphasizing the utility and relevance of the final summary. To generate preference pairs, they query DeepSeek-V3 four times and randomly assign responses as either Response A or Response B; they then average the independent judgments and retain pairs where the score difference is sufficiently large and use a pairwise loss to define the objective. For harmlessness , they evaluate the response to identify and mitigate any potential risk, biases, or harmful content. Using a dataset with model-generated responses annotated as “safe” or “unsafe” according to safety guidelines, the reward model is trained using a point-wise methodology to distinguish between safe/unsafe responses. Impressively, they found that distilling DeepSeek-R1’s outputs into smaller models like Qwen-32B significantly improved reasoning capabilities, even compared to large-scale RL, which further requires significantly more compute. Moreover, it shows that even while distillation strategies are effective and economical, we will increasingly require more powerful base models and larger-scale RL. Figure 12 : Comparison of DeepSeek-R1-distilled and RL Models on Reasoning-Related Benchmarks. From DeepSeek https://arxiv.org/pdf/2501.12948 . Beyond the main training pipeline, DeepSeek’s appendix documents additional considerations that influenced their design choices: GRPO over PPO : PPO has a per-token KL penalty the KL between sequence distributions decomposes into a sum over time of KL between tokens, given the autoregressive nature . Because RL does enable longer reasoning over time and PPO implicitly penalizes the length of the response and is less computationally expensive due to not needing an additional value model , GRPO is preferable; on MATH tasks, GRPO consistently performed better than PPO with $\lambda=1.0$, which further consistently performs better than PPO with $\lambda=0.95$. product-driven DeepSeek-R1 : users find responses more intuitive when the reasoning process aligns with first-person thought patterns. So, after finetuning on a small amount of long CoT data, DeepSeek-R1 uses “I” more whereas DeepSeek-R1-Zero uses “we” more. Other considerations were previously mentioned, like language consistency while ensuring CoT remains coherent and aligned. The raw CoT produced by DeepSeek-R1-Zero may have possessed potential beyond limitations of current human priors, so human annotators convert the reasoning trace into one that is more human-interpretable/conversational. temperature for reasoning : they observed that greedy decoding to evaluate long-output reasoning models resulted in higher repetition rates and more variability. This coincides with recent research https://arxiv.org/pdf/2512.12895 , and it can be explained by risk aversion due to hardness of learning and inductive bias for temporally correlated errors , which describes that at decision points, the model tends to reselect previously favored actions resulting in looping . RLVR and rubrics The goal of RLVR on hybrid reasoning models is to improve reasoning capabilities without extending the token count too radically. For /no think , naively applying GRPO can lead to reward hacking since the model begins to emit longer CoT shifting towards /think ; as such, both reward and token length increase. SmolLM3 observed this and found that RLVRed /no think traces showed cognitive behaviors https://arxiv.org/abs/2503.01307 like “Wait, …” associated with reasoning models. This can be mitigated via an overlong completion penalty https://arxiv.org/abs/2503.14476 which penalizes completions over a certain length, which is a function parametrized by a soft punishment threshold and a hard punishment threshold/max completion length. Penalty increases from the soft to the hard threshold, and past the latter, the punishment is -1 effective reward = 0 . For /no think , SmolLM3 decided on a length penalty in the range of 2.5k-3k that balanced improvement in performance and increase in response length. However, doing RL jointly on hybrid reasoning models is difficult since it requires separate length penalties, whose interplay can cause instability. This is also why labs release instruct and reasoning variants separately. Kimi K2 uses a self-critique rubric reward mechanism, where the model evaluates its own outputs to generate preference signals. The K2 actor generates $k$ rollouts, and the K2 critic ranks all results by performing pairwise evaluations against a combination of rubrics; these combine core rubrics fundamental values and prescriptive rubrics aimed to eliminate reward hacking , and human-annotated rubrics for specific instructional contexts . The critic model is refined using verifiable signals, and this process of transfer learning grounds its more subjective judgments in verifiable data. This should allow the critic to recalibrate its evaluation standard in lockstep with the policy’s evolution. online data filtering To RL effectively, curriculum learning is another effective way which gradually exposes the model to progressively difficult problems. First, problems are sorted into difficulty pools such as easy, medium, and hard based on the problem’s observed solve rate; In Intellect-3 for math and coding, this was done via querying Qwen/Qwen3-4B-Thinking-2507 over eight generations per problem while for science and logic, they queried the same model 16 times. Then, during each stage, they maintain a balanced curriculum that avoids training with trivially easy or overly difficult problems which don’t give meaningful learning signal and also helps maintain gradients in GRPO . In Kimi K2, this was done by using the SFT model’s pass@k accuracy. alternatives to RL One alternative is online DPO see “On policy with grading” in the preference optimization section . Another is on-policy distillation . Instead of preferences, the signal comes from a stronger teacher model, where the student samples responses at every training step and the KL divergence between the student/teacher logits provides the learning signal. That way, the student can continuously learn from the teacher. Also, on-policy distillation is much cheaper than GRPO since instead of sampling multiple rollouts per prompt, we only sample one, which is graded by the teacher in a single forward-backward pass; its performance boost, as the Qwen3 tech report notes, can be larger across the board as well. One limiting factor is that the student and the teacher must share the same tokenizer, and Hugging Face’s General On-Policy Logit Distillation https://huggingface.co/spaces/Hugging FaceH4/on-policy-distillation GOLD allows any teacher to be distilled into any student. Thinking Machine’s blog https://thinkingmachines.ai/blog/on-policy-distillation/ further showed that on-policy distillation mitigates catastrophic forgetting , where a model post-trained on a new model regresses on other, previous domains. Specifically, they found that mid-training 70% and with on-policy distillation can achieve close to the best performance of a model and its mid-trained version, effectively restoring behavior with cheap distillation. Given these aforementioned algorithms, choosing between them can be hard; Hugging Face aptly describes it: | Algorithm | When to Use | Tradeoffs | Best for Model Size | |---|---|---|---| | Online DPO | You can get preference labels cheaply. Best for aligning behaviour with evolving distributions. | Easy to scale iteratively, more stable than RL, but depends on label quality and coverage. Supported in few training frameworks. | Any size, where preferences capture improvements beyond imitation. | | On-policy distillation | You have access to a stronger teacher model and want to transfer capabilities efficiently. | Simple to implement, cheap to run, inherits teacher biases, ceiling limited by teacher. Supported only in TRL and NemoRL | Most effective for small to mid-sized models <30B . | | Reinforcement learning | Best when you have verifiable rewards or tasks requiring multi-step reasoning/planning. Can be used with reward models, but there are challenges like reward-hacking, where the model takes advantage in weaknesses in the reward model. | Flexible and powerful, but costly and harder to stabilise; requires careful reward shaping. Supported in most post-training frameworks. | Mid to large models 20B+ , where extra capacity lets them exploit structured reward signals. | And for DPO semi-online and online , it is also possible to match GRPO using far less compute. Specifically, they found that semi-online DPO with syncing between the trainer and the generator every 100 steps was generally the best compared to semi-online DPO with sync every 10 steps, online DPO, and GRPO. limitations DeepSeek shares other experimental methods when developing DeepSeek-R1 that ultimately failed. Monte Carlo Tree Search MCTS , inspired by AlphaGo https://arxiv.org/abs/1712.01815 and AlphaZero https://arxiv.org/abs/1712.01815 , was implemented to test enhancing test-time compute scalability. This breaks answers into smaller parts to allow the model to explore the solution space systematically. To do this, they prompted the model to generate tags corresponding to reasoning steps necessary. The problem is that token generation exists in an exponentially larger search space compared to chess. So, they set a max extension limit for each node, but this leads to the model getting stuck in local optima. Moreover, training a fine-grained value model is difficult, also due to complexities of token generation. They also explored process reward models , which rewards intermediate thoughts in multi-step tasks. DeepSeek acknowledges three limitations: defining a fine-grained step in general reasoning is difficult, determining whether the current intermediate step is difficult LLM-as-judge might not yield satisfactory results , and it leads to reward hacking because the model would optimize for the appearance of good reasoning without doing the underlying work. post-training takeaways - SFT is the stable baseline; preference/RL methods should be justified by verifiable rewards or clear gains. - Hybrid reasoning models need careful length control to avoid reward hacking. - Tool-use and agentic datasets are now first-class post-training targets. - Many “fancy” methods fail in practice; track what doesn’t work, not just what does. behaviors and safety safety testing and mitigation During post-training, they perform an additional stage of RL to reward answers that comply with OpenAI’s policy against unsafe prompts. Because all of these models have open model weights, then one worry is that malicious parties can enhance the model’s harmful capabilities. By running Preparedness evaluations on gpt-oss-120b, OpenAI confirmed that the model doesn’t achieve threshold for high capability in biological/chemical capability, cyber capability, and AI self-improvement. They also tested whether adversarial actors could fine-tune gpt-oss-120b to reach high capability in the aforementioned domains. They simulated the attack by either: - evaluating the model in collegiate Capture the Flag challenges or cyber operations in a realistic, emulated network - creating adversarially fine-tuned versions via incrementally training it on in-domain human expert data related to biorisk - benchmarking on SWE-bench verified https://openai.com/index/introducing-swe-bench-verified/ , OpenAI PRs, and PaperBench https://arxiv.org/abs/2504.01848 ability to recreate AI research Upon review, their safety advisory group concluded that even with robust fine-tuning, gpt-oss-120b still couldn’t reach high capability in the aforementioned domains. Also, they determined whether releasing the model could advance the frontier of biological capabilities, including evaluations like virology and tacit knowledge, in open foundation models which also increases risk and found that there were other open weight models at or near gpt-oss-120b, so they decided that releasing has low impact on the frontier. OpenAI also evaluated safety performance using other indicators: Disallowed content : benchmarked on ProductionBenchmarks , they consider different categories such as PII, sexual, harassment, hate, self-harm using conversations more representative of production data. They evaluate completions using LLM-as-judge and determine not unsafe according to the relevant OpenAI policy. Performance is on par with o4-mini. Jailbreaks : they test with adversarial prompts that try to circumvent model refusals, specifically StrongReject which inserts jailbreak examples into the safety refusal eval and also uses the same policy grader. Performance is on par with o4-mini. Instruction Hierarchy : it follows system developer user assistant tool. They post-trained the model using system, developer, and user messages, sometimes conflicting, and the model must learn to choose the instruction higher in the hierarchy. This includes testing system prompt extraction via a user message and prompt injection hijacking. For PII protection, performance is on par with o4-mini, but for message conflict, gpt-oss-120b underperforms o4-mini by ~15%. Hallucinations and CoT : reasoning model’s CoT can be very helpful for detecting misbehavior, and that models could learn to hide their thinking while misbehaving when pressured against having “bad thoughts.” To measure hallucinations, they use two datasets of either fact-seeking questions or publicly available facts about people and consider accuracy and hallucination rate. Here, hallucination rate is not just defined as 1 - accuracy, since the model can also output answers like “I don’t know.” Performance is a bit worse than o4-mini, which is expected given model size. Fairness and Bias : they also evaluate gpt-oss-120b on the BBQ evaluation https://arxiv.org/abs/2110.08193 which tests social bias against people belonging to protected classes along nine social dimensions. Performance is on par with o4-mini . behaviors and latent capabilities Continuing from the chat template section, Nous’ decision to change the token used from the assistant’s turn from $\mathtt{assistant}$ to $\mathtt{me}$ enabled Hermes 4 to adopt a first-person, peer-like persona. Hermes 4 generates responses with fewer meta-disclaimers and more consistent voice embodiment, resulting in higher behavioral plasticity that is not as common in large models. Similarly, Hermes 4 demonstrates comparatively greater contextual fidelity over policy rigidity . Most other large models will follow policy compliance even when faced with fictional or controlled prompts such as issuing disclaimers or reformulated responses to align with safety constraints , but Hermes 4 interprets fictional prompts more as role-play and generates in-character responses. This also means that Hermes 4 has a lower refusal rate; on their internal RefusalBench, they found that Hermes 4 reasoning ranked highest lowest refusal rate among all tested models, whereas gpt-oss-120b and gpt-oss-20b, perhaps unsurprisingly, had the lowest scores highest refusal rate . Also, this level of embodied personas extends even to political analysis, where the model produces reasoning balancing factual recall with nuanced framing, and less of policy-driven hedging common to other large models. Excessive sycophancy is an undesired behavioral trait, so most models apply an anti-sycophancy system prompt to adjust surface-level politeness while leaving underlying reasoning unchanged. When implemented in Hermes 4, Nous observed a deeper shift: CoT traces reflect the aim to steer user interaction away from inference . This sometimes introduces embodied or emphatic language. the training marathon Before the main training run starts, ensure the infrastructure is ready. This includes Slurm reservations on clusters , stress-testing GPUs GPU Fryer https://github.com/huggingface/gpu-fryer or DCGM https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html , and avoid storage bloat by uploading checkpoints to third parties and deleting local copies after saving the next. To this end, checkpoint and auto-resume systems are important. Evals are also deceptively time-consuming Allen Institute spent roughly 20% on compute on evals , so ensuring automation and logging not just evaluation scores, but also throughput, loss, gradient norm, and node health is crucial. vanishing throughput Hugging Face observed a ~40% drop in throughput 14k to 8k tokens/sec/GPU after a few hours of starting the main run. The issue came from data storage; their cluster uses a network-attached storage with a “keep-hot” caching model that stores frequently accessed files and evicts “cold” files to third-party S3. With 24TB of training data, the storage was pushed to its limit, so it evicted dataset shards mid-training. This meant fetching them back and creating stalls that slowed throughput. The first fix came in the form of swapping the storage method by reserving a spare node with the dataset preloaded and copying using fpsync s5cmd took double the time . This fixed the issue of a node dying and the replacement GPU having no data since by swapping it with the spare node, training could continue. So, the new spare, not to be wasted, could run evals or dev jobs. Testing again, they found smaller but still prominent drops in throughput. After experimenting with individual nodes that yielded the same result, they focused on the change in training steps and found that smaller step counts resulted in smaller throughput drops. The nanotron dataloader they were using was growing the lookup table making the training step to the next chunk of tokens to read instead of keeping it bounded or precomputed. Stored in global memory, the growing table causes allocation failures and page faults/worse cache locality. So, they switched to Tokenizedbytes dataloader, solving the throughput issue. noisy loss However, the loss curve for SmolLM3 looked more noisy. They found the issue with the dataloader because it reads sequences sequentially for each document. Without shuffling of sequences , batches are no longer representative of the overall data distribution, increasing gradient variance. Also, a long file e.g. code would supply many consecutive sequences that would also spike loss. To fix, they reshuffled the tokenized sequences offline; an alternative was changing the dataloader to do random access, which has both higher memory usage and slower runtime. tensor parallelism After two days and 1T tokens, evals showed that with a similar recipe, SmolLM2 1.7B was more performant at the same stage in training as SmolLM3 was. The team found the issue with tensor parallelism : the weights of SmolLM2 fit on a single GPU, whereas for SmolLM3, they had to be shared across 2 GPUs. Further, the two TP ranks were initialised with the same random seed instead of different seeds, which causes similar activations/gradients, a loss of diversity of features, and lower convergence. multi-client orchestrator Inference throughput should scale linearly with the number of nodes used. However, Prime found that the standard multi-node data-parallel strategy provided by vLLM didn’t deliver this because as nodes increased, throughput plateaued. They abstracted the multi-client orchestrator so that each inference node is deployed on an independent server runs its own vLLM engine and scheduler, manages its own KV cache and batches its own requests , and the orchestrator maintains one client per node avoids single-shared queue bottleneck . Groups rollout requests are distributed across clients according to round-robin scheduling, which keeps utilization balanced. the usual suspects There are a few common culprits for training instabilities: high learning rate, bad data, data-parameter state interactions spikes can come from specific combinations of data batches and model parameter states https://arxiv.org/abs/2204.02311 , poor initialisation OLMo2 https://arxiv.org/abs/2501.00656 revealed that $\mathcal{N} 0, 0.02 $ can improve stability upon scaled initialisation , and precision eww, not fp16 . Besides aforementioned ideas like logit softcapping , z-loss , or QK-norm , data filtering OLMo2 removed documents with repeated n-grams, specifically those with 32+ repetitions of 1-13 token spans significantly reduces spike frequency. If spikes still occur, common methods include retraining around the spike by skipping problematic batches or tightening gradient clipping . training ops takeaways - Throughput failures are often data pipeline or storage issues, not model code. - Dataloader behavior shuffling, packing, access patterns can silently change training dynamics. - Seed handling in parallelism setups is a high-leverage detail; verify it early. - Treat evals and logging as first-class citizens; they are how you notice regressions.