Frontier Model Training Methodologies Seven open-weight frontier models, including Hugging Face's SmolLM3, DeepSeek-R1, and OpenAI's gpt-oss-120b, were analyzed to distill common training methodologies for multi-billion parameter models, with a focus on architecture, stability, data curation, and optimization techniques. The analysis reveals that frontier model training is primarily a systems engineering challenge, where data mixture, architecture choices, and stability measures outweigh algorithmic tweaks, and that most training failures stem from high learning rates, problematic data batches, or load imbalance in mixture-of-experts models. The findings provide a minimal training playbook emphasizing early evaluation locking, baseline architecture selection, and rigorous data pipeline construction to guide future large-scale model development. frontier model training methodologies How do labs train a frontier, multi-billion parameter model? We look towards seven open-weight frontier models: Hugging Face’s SmolLM3 https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook , Prime Intellect’s Intellect 3 https://arxiv.org/abs/2512.16144 , Nous Research’s Hermes 4 https://arxiv.org/abs/2508.18255 , OpenAI’s gpt-oss-120b https://arxiv.org/pdf/2508.10925 , Moonshot’s Kimi K2 https://arxiv.org/pdf/2507.20534 , DeepSeek’s DeepSeek-R1 https://arxiv.org/pdf/2501.12948 , and Arcee’s Trinity series https://github.com/arcee-ai/trinity-large-tech-report/blob/main/Arcee%20Trinity%20Large.pdf . This blog is an attempt at distilling the techniques, motivations, and considerations used to train their models with an emphasis on training methodology over infrastructure. These notes are largely structured based on Hugging Face’s SmolLM3 report https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook due to its extensiveness, and it is currently supplemented with notes from other reports including Intellect-3, gpt-oss-120b, Hermes 4, DeepSeek, and Kimi. While this blog explores some infrastructure-related ideas like in-flight weight updates and multi-client orchestrators, there are many other ideas mentioned throughout those posts/blogs like expert parallelism and quantization. Hugging Face writes more about gpt-oss-120b’s infrastructure here https://huggingface.co/blog/faster-transformers . table of contents tl;dr tldr architecture and set-up architecture-and-set-up stability stability tokenizer tokenizer optimizers and training hyperparameters optimizers-and-training-hyperparameters data curation and pre-training data-curation-and-pre-training mid-training mid-training post-training post-training behaviors and safety behaviors-and-safety the training marathon the-training-marathon tl;dr - Frontier training is a systems problem: data mixture, architecture, and stability choices dominate most algorithmic tweaks. - Start from a strong baseline and ablate fast and reliably; derisk changes and avoid multi-variable edits. - For long context, document masking + RNoPE/YaRN-style scaling is a robust default; attention variants trade compute for reach. - GQA with small groups 2/4/8 groups typically outperforms MHA and MQA in ablations at similar model scales; MLA cuts KV cache but raises implementation complexity. - MoE is efficient when it is load-balanced; routing, auxiliary or bias balancing, and global stats are non-negotiable. - Tokenizer design should mirror target data; vocab size trades embedding cost against token compression and KV cache. - AdamW is still the default; Muon can help but needs careful infra all-to-all, padding, scaling quirks . - Scaling laws guide, but many frontier models overtrain; inference cost and sparsity tradeoffs often drive final choices. - Data scheduling matters: multi-stage mixtures and late-stage high-quality injection shape final behavior. - Mid-training and post-training SFT + preference/RL/distillation often determine reasoning and tool-use behavior. - Training ops are frequent failure points: dataloader design, throughput, seeds in TP, and checkpointing. - Most training failures stem from common causes: high learning rates, problematic data batches, load imbalance in MoE models, or storage/infrastructure issues see “the usual suspects” section for details . a minimal training playbook - Define the product goal and lock evals early across knowledge, math, code, long-context, and instruction following. - Pick a baseline architecture with known failure modes; default to dense + GQA + RoPE/RNoPE unless MoE is essential. - Choose a tokenizer matched to your target languages and domains; freeze vocab and special tokens early. - Build the data pipeline with deduplication, filtering, and contamination checks; measure data quality explicitly. - Run small ablations for attention, positional encoding, optimizer, and learning rate schedule; change one variable at a time. - Plan a multi-stage data mixture; delay the best data and reasoning-heavy data toward the end. - Add stability guardrails: logit softcapping preferred, per Gemma or z-loss/QK-norm, gradient clipping, precision policy, loss spike alerts. - Validate throughput on long runs and confirm dataloader behavior packing, shuffling, random access . - Run the main training with interval evals and consistent seeds, especially for tensor parallelism. - Mid-train for domain gaps if SFT reveals them; extend context length gradually 4k → 32k → 64k → 128k . - Post-train with SFT, then choose preference/RL/distillation based on verifiable rewards and tool-use goals. - Re-evaluate, run safety checks, and lock a release checkpoint with full logs and configs. general practices - “ Learn to identify what’s worth testing, not just how to run tests. Perfect ablations on irrelevant choices waste as much compute as sloppy ablations on important ones.”- Ablations need to be fast faster iteration $\rightarrow$ more hypotheses tested and reliable need strong discriminative power because otherwise, it may be noise - “The real value of a solid ablation setup goes beyond just building a good model. When things inevitably go wrong during our main training run and they will, no matter how much we prepare , we want to be confident in every decision we made and quickly identify which components weren’t properly tested and could be causing the issues. This preparation saves debugging time and keeps our sanity intact. There’s nothing worse than staring at a mysterious training failure with no idea where the bug could be hiding.” - Ablations need to be Choose an established baseline with good architecture and training setup design . These take years of iteration, and people have discovered common failure modes and instabilities.- There are a plethora of modifiable components attention mechanisms and positional encodings to name a few , but follow the principle of derisking : “never change anything unless you’ve tested that it helps.” - There are a plethora of modifiable components attention mechanisms and positional encodings to name a few , but follow the principle of In evals, look for monotonicity score improvement , low noise e.g. score resistance to random seeds , above-random performance random-level performance for extended time frames isn’t useful , and ranking consistency ranking of approaches should remain stable throughout training .- Prioritize evals Between pre-training and post-training, core evals should be preserved, and their implementation should be finished long before the base model is finished training. Balance exploration and execution. For methods, choose flexibility and stability over peak performance, set a deadline for exploration. architecture and set-up Architecture decisions fundamentally determine a model’s efficiency, capabilities, and training dynamics. Model families like DeepSeek, gpt-oss-120b, Kimi, and SmolLM have vastly different architectures dense vs MoE , attention mechanisms MHA vs MLA vs GQA , position encodings RoPE, partial RoPE, NoPE , among many others. Not all information about the models is publicly available, so some are chosen: | Kimi-K2 | Trinity Large | gpt-oss-120b | OLMo 3 | SmolLM | | |---|---|---|---|---|---| | Parameter Count | 1.06T | 400B | 116.83B | 32B | 3B | | Attention | MLA | GQA 8 groups | GQA 8 groups | GQA ? | GQA 4 groups | | Positional Embedding | RoPE ? + YARN | RoPE + YARN | RoPE + YARN | RoPE + YARN | RNoPE + YARN | | Architecture | MoE | MoE | MoE | dense | dense | | Tokenizer | | o200k harmony https://github.com/openai/tiktoken cl 100k https://github.com/openai/tiktoken When choosing architecture, Hugging Face suggests following a decision tree such that if one of these is true, choose a dense architecture: - memory-constrained since MoEs must have all experts loaded - new to LLM training focus on basics - tighter timeline simpler training with well-documented recipes architecture decision heuristics - If you are memory- or infra-constrained, default to a dense model with GQA and RoPE/RNoPE. - If you need inference efficiency at scale and can manage routing complexity, consider MoE with strong load balancing. - If long context is a core requirement, plan for document masking plus RoPE scaling ABF/YaRN or RNoPE variants. - If you need simpler kernels and faster iteration, avoid novel attention variants unless you can ablate them cleanly. attention Multi-head attention MHA uses separate query, key, and value projections for each attention head, but this creates a large KV-cache that becomes an inference bottleneck and GPU memory hoarder. To address this, researchers developed multi-query attention https://arxiv.org/abs/1911.02150 MQA and grouped query attention https://arxiv.org/abs/2305.13245 GQA . In MQA, KV values are shared across all heads, but this comes at a cost of leaking attention capacity because heads can’t store information specialized to each head’s role. GQA softens this issue by sharing KV values across a small group of heads e.g. 4 . Another alternative is multi-latent attention MLA which stores a compressed latent variable that can be decompressed/projected into KV values at runtime. The latent variable is typically much smaller than the full KV cache often achieving 4-8x compression , and this results in a KV-cache parameter count more comparable to GQA while maintaining performance stronger than MQA. When ablating for variables that change the parameter count such as changing MHA to GQA, they occasionally adjust other hyperparameters to keep model sizes roughly the same , Hugging Face found that GQA with small groups 2/4/8 outperformed MHA in their ablations and that MHA outperformed MQA and GQA with 16 groups . Across benchmarks like HellaSwag, MMLU, and ARC, GQA with 2/4/8 groups performed best in their experiments. gated attention Gated attention applies an elementwise gating mechanism to the scaled dot-product attention output before the output projection. A gate vector $\mathbf{g} t = \sigma \mathbf{W}^G \mathbf{x} t $ is computed from the input, where $\mathbf{x} t$ is the input at position $t$, $\sigma$ is the sigmoid function, and $\mathbf{W}^G$ is a learned gate projection matrix. This gate is split across $h q$ attention heads where $h q$ is the number of query heads , and each head’s attention output is elementwise multiplied by its corresponding gate segment: $\tilde{\mathbf{o}} {t,i} = \mathbf{o}^{\text{sdpa}} {t,i} \odot \mathbf{g} {t,i}$ where $\mathbf{o}^{\text{sdpa}} {t,i}$ represents the scaled dot-product attention output for head $i$ at position $t$, $\odot$ denotes elementwise multiplication, and $\mathbf{g} {t,i}$ is the gate segment for head $i$. The gated outputs are then concatenated and projected through the output matrix $\mathbf{W}^O$ to produce the final output. Gated attention reduces attention sinks tokens receiving disproportionately high attention , reduces large activations that destabilize training, and improves performance on evaluations and long-sequence generalization. Critically, it stabilizes training and reduces loss spikes, making it valuable for large-scale training. document masking When pre-training, a common consideration is fixed sequence lengths since training uses tensors of the form batch, sequence length, hidden , so with regards to batching and distributed training, GPUs are most happy when every example has the same sequence length. But due to variable document length and wanting to avoid padding which wastes compute, packing enables shuffling and concatenating documents within the same sequence to achieve the sequence length. Causal masking means that for unrelated files $A$ and $B$ in the same batch, the tokens in $B$ can attend to the tokens in $A$, which degrades performance. With intra-document masking , the attention mask is modified so tokens can only attend to previous tokens within the same document. Many papers have found benefits relating to long-context extension https://arxiv.org/abs/2407.21783 and some short context benchmarks https://arxiv.org/abs/2410.02660 as well as shortening the average context length https://arxiv.org/abs/2503.15450 . Figure 1 : Comparison of causal masking left and intra-document masking right . Causal masking allows tokens to attend to all preceding tokens regardless of document boundaries, while intra-document masking restricts attention to tokens within the same document. From @PMinervini https://x.com/PMinervini/status/1777596492351422866 . When implementing document masking, Hugging Face saw small improvements on PIQA but otherwise no noticeable impact on short context tasks. But in line with aforementioned research, they observed that it became crucial for scaling from 4k to 64k tokens. The decision of whether to use intra-document attention masking can depend on model scale. For smaller models, some implementations choose to omit intra-document masking, finding that the additional complexity and potential reduction in cross-document learning doesn’t justify the benefits at those scales. However, for larger models, intra-document masking becomes more critical as the model’s capacity to learn from cross-document attention patterns diminishes relative to the benefits of cleaner document boundaries. embedding sharing Input embeddings token-to-vector lookup and output embeddings hidden states to vocab logits are typically represented as separate matrices, so the total embedding parameters are $2 \times \text{vocab size} \times \text{hidden dim}$. In small language models, this can account for up to 20% of total parameters, as is the case with Llama 3.2 1B in larger models, the embeddings represent a much smaller fraction of the parameter count, only 3% in Llama 3.1 70B . The issue with tying them is that input/output embeddings still represent different geometries, and frequent tokens like “the” can dominate representation learning due to getting gradients from both the input stream and the predicted output. Figure 2 : Comparison of untied embeddings separate input and output matrices vs tied embeddings shared matrix . Tied embeddings reduce parameter count while maintaining comparable performance. From PyTorch Blog https://pytorch.org/blog/advancing-low-bit-operators-in-pytorch-and-executorch-dynamic-kernel-selection-kleidiai-and-quantized-tied-embeddings/ . Hugging Face found that on a 1.2B model, tied embeddings did comparably well despite having 18% fewer parameters down from 1.46B , and that compared to an untied model also with 1.2B parameters fewer layers , untied showed higher loss and lower downstream eval scores. positional encodings Without positional encoding, transformers have no sense of word order, akin to the bag of words idea. Initially, absolute position embeddings https://arxiv.org/abs/1706.03762 were used by learning a lookup table that mapped the position index to a vector added to token embeddings, but the maximum input sequence length was limited by the sequence length it was trained on. Relative position encodings followed since capturing distance between tokens matters more than capturing their absolute positions. The most commonly used technique is rotary position embedding RoPE https://arxiv.org/abs/2104.09864 , which encodes position information by rotating query and key vectors in 2D planes. RoPE encodes relative position as rotation angles: based on the dimensionality of the query/key vector, RoPE splits it into pairs since they rotate in 2D space and rotates depending on the absolute position of a token and a base frequency. During attention, the dot product between their rotated positions directly encodes their relative distance via the phase difference in their rotation angles, where tokens $x$ positions apart always maintain the same angular relationship. Figure 3 : RoPE splits query/key vectors into pairs and rotates each pair by an angle proportional to position. From Su et al., 2021 https://arxiv.org/abs/2104.09864 . During pre-training, models are trained on shorter context lengths similar ideas to document masking, and quadratic attention is expensive to learn short range correlation between words. But as sequence length grows, the rotation angles grow via $\theta= \text{position} \times \frac1{\text{base}^{\frac{k}{\text{dim}/2}}}$. This can be fixed by increasing the base frequency as the sequence length increases using methods like ABF Adaptive Base Frequency https://arxiv.org/abs/2309.16039 or YaRN https://arxiv.org/abs/2309.00071 , which applies a more granular interpolation of frequencies on different components and includes other techniques like dynamic attention scaling and temperature adjustment. For extremely long contexts, YaRN does best, and in gpt-oss-120b, it was used to extend the context length of dense layers up to 131k tokens. More recently, with the emphasis on long contexts, NoPE https://arxiv.org/abs/2305.19466 no position embedding and RNoPE https://arxiv.org/abs/2501.18795 , a hybrid method, have emerged. NoPE uses only causal masking and attention patterns, so it doesn’t bump into the issue of extrapolating beyond training lengths but shows weaker performance on short context reasoning and knowledge-based tasks. RNoPE alternates applying RoPE and NoPE on attention blocks, where RoPE handles local context and NoPE helps with longer-range information retrieval. Another idea is Partial RoPE, which applies RoPE/NoPE within the same layer. Hugging Face ran ablations using RoPE, RNoPE removing positional encoding every 4th layer , and RNoPE with document masking. They found that all achieve similar performance on short-context tasks, so they adopt RNoPE + document masking because it provides the foundation for long-context handling. attention for long contexts Figure 4 : five common types of attention. From Hugging Face https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook . Note: This section covers attention pattern modifications which change which tokens can attend to which other tokens . These are distinct from positional encoding scaling methods like ABF/YaRN discussed in the “positional encodings” section , which adjust how position information is encoded without changing attention patterns. The following methods modify attention patterns to reduce computational cost: Chunked Attention : divides the sequence into fixed-sized chunks where tokens can only attend within their chunk. Llama 4 https://ai.meta.com/blog/llama-4-multimodal-intelligence/ pairs RNoPE specifically the RoPE layers which also reduces the KV cache size per layer, but its performance on long context tasks degraded. Sliding Window Attention SWA : every token can see up to $p$ positions back, creating a sliding window that maintains local context. Gemma 3 https://arxiv.org/abs/2503.19786 combined SWA with full attention every other layer. Dual Chunk Attention DCA : $K$ tokens are chunked into $M$ groups. Within each group like chunked attention , tokens attend normally. Between successive chunks, there is a local window to preserve locality, and more broadly, inter-chunk attention allows queries to attend to previous chunks with a capped relative position cap. Qwen-2.5 https://arxiv.org/pdf/2412.15115 used DCA to support context windows of up to 1 million tokens. Interleaving local and global attention alternates between layers that use local attention restricted to nearby tokens and global attention full sequence . This pattern balances computational efficiency with the ability to capture both local and long-range dependencies. Local layers reduce quadratic complexity while maintaining local context, and global layers ensure that distant relationships aren’t lost. When training encounters instability or loss spikes, adjusting the ratio of global layers for example, increasing their frequency can result in quicker loss recovery, as the model regains access to long-range information that may be crucial for certain patterns. The interleaving strategy is particularly effective for long-context models where full global attention would be computationally prohibitive. MoE MoEs mixture of experts , analogous to our brain activating different regions for different tasks, provide an alternative to dense models. At inference, only certain “experts” are activated based on the input, dramatically reducing compute compared to dense models where all parameters are active. The MoE works by replacing the feed forward layer with multiple MLPs experts and adding a learnable router before the MLPs to select the experts. The router typically uses top-k gating, selecting the $k$ experts with highest affinity scores for each token, where $k$ is usually much smaller than the total number of experts e.g., 8 out of 384 . Figure 5 : Comparison of dense architecture and MoE architecture. From Sebastian Raschka https://sebastianraschka.com/ . In general, for fixed number and size of active experts, increasing the total number of experts improves loss, and high sparsity improves performance https://arxiv.org/abs/2507.20534 and benefits more from increasing compute https://arxiv.org/abs/2507.17702 . Recent models are much more sparse, with over 100 experts and around 10 active per token. To determine how large each expert should be, a common metric is granularity, defined by $G = 2 \cdot \frac{d \text{model}}{d \text{expert}}$, where a higher granularity corresponds to more experts with a smaller dimension; this can be interpreted as a number proportional to the experts needed to match the dense MLP width. Recent models have granularity anywhere from 2 gpt-oss-120b to 8 qwen3-next-80b-a3b . Ant Group https://arxiv.org/pdf/2507.17702 showed that granularity doesn’t significantly change loss but does drive efficiency leverage the ratio of flops needed for an MoE to achieve the same loss as a dense model . And overall, MoEs present a good alternative to dense models in terms of compute for training and inference. Shared experts are always-on experts, which absorb the basic, recurring patterns so that other experts can more aggressively specialize; one is often enough DeepSeek-V2 https://arxiv.org/abs/2405.04434 uses two, which adds a bit of complexity . Load balancing is crucial in that if it fails, not only do training and inference efficiency plummet, but so do effective learning capacity. The routing mechanism typically uses top-k gating : for each token, the router computes affinity scores often via a learned linear projection followed by softmax , selects the top $k$ experts, and routes the token to those experts. To ensure balanced expert utilization, this can be addressed by adding a loss-based load balancer LBL given by $\mathcal{L} = \alpha \sum {i=1}^{N r} f i P i$ where $N r$ is the total number of experts, $\alpha$ determines the strength of the balancing term, $f i$ is the fraction of tokens routed to expert $i$, and $P i$ is the probability mass average routing probability for expert $i$; so in perfect load balancing, $f i=P i=\frac1{N r}$. Also, $\alpha$ should not be so large that routing uniformity overwhelms the primary training objective. These should be monitored using global statistics , not local statistics which may suffer from a local batch being narrow, biasing the routing statistics. DeepSeek-V3 https://arxiv.org/abs/2412.19437 does loss-free load balancing differently, by adding a bias term to affinity scores going into the routing softmax. Beyond bias-based approaches, several other routing and load balancing strategies have emerged. Some implementations use learnable routing functions that adapt during training, while others incorporate expert capacity constraints that prevent any single expert from being overwhelmed. The key insight across these methods is that effective load balancing must operate using global statistics aggregated across multiple batches, as local batch statistics can be misleadingly narrow and bias routing decisions. Sequence-wise auxiliary loss extends traditional auxiliary losses to promote balance within a sequence. Here, $T$ is the sequence length, $\alpha$ is a small coefficient, $\mathbb{1} \cdot $ is the indicator function which is 1 if its argument is true and 0 otherwise , and $K r$ is the number of active experts per token. Here, for each token at each position $t$ in the sequence, each expert $i$ is assigned a routing score $s {i,t}$, which is normalized so that $\tilde{s} {i,t}$ captures the proportion of the routing probability assigned to expert $i$ at position $t$. Averaging this over the whole sequence gives $P i$, which represents, on average, how often expert $i$ is considered for routing across the sequence. The $f i$ term furthers this by reflecting the fraction of times expert $i$ is actually selected i.e., is among the top $K r$ experts for a token, after bias terms $b i$ are added . The loss $\mathcal{L} {\text{Bal}}$ encourages the product $f i P i$ to be similar across different experts, pushing the model toward evenly distributing routing decisions and load; if any expert is used much more or less than others, the loss will increase, nudging the model back toward balanced expert activation. Auxiliary-loss free load balancing methods avoid introducing interference gradients by maintaining a bias vector $\mathbf{b}= b 1, \cdots, b {N r} $ which is updated in a decoupled fashion. Let $n i$ be the number of tokens routed to expert $i$ in the current step and $\bar{n}=\frac1{N r} \sum {i=1}^{N r} n i$ the mean load across all experts. $b i$ is updated by where $\gamma$ is the bias update speed, a sort of learning rate. This particular version includes the additional recentering of expert bias updates. Sequence-wise MoE Balancing with Uniformity SMEBU load balancing operates at the sequence level rather than the token level, ensuring that expert utilization remains balanced across entire sequences. The normalized per-expert violation is calculated by $v i=\frac{\bar{n}-n i}{\bar{n}}$ and $\tilde{v} i=\tanh \kappa v i $, which makes the scale independent of sequence length and batch size. Then $b i$ is updated using a momentum buffer with momentum factor $\beta$: $\tanh$ applies the soft-clamping, with tunable scale $\kappa$ to control saturation speed; $\tanh$ over $\text{sign} \cdot $ maintains the continuity and stability needed during training whereas $\text{sign}$ forces updates to be $\pm \lambda$, making the update step oscillate. Momentum also is introduced as a form of noise dampening, analogous to momentum SGD reducing variance in noisy gradient updates hybrid models Because transformers don’t deal efficiently with long context while RNNs can, one idea is to combine both to get the best of both worlds. By dropping the softmax from the output for token $t$: \ \mathbf{o} t = \sum {j=1}^t \frac{\exp \mathbf{q} t^\top \mathbf{k} j \mathbf{v} j}{\sum {l=1}^t \exp \mathbf{q} t^\top \mathbf{k} l } \Longrightarrow \mathbf{o} t = \sum {j=1}^t \mathbf{q} t^\top \mathbf{k} j \mathbf{v} j = \left \sum {j=1}^t \mathbf{v} j \mathbf{k} j^\top\right \mathbf{q} t\ where $\mathbf{q} t$, $\mathbf{k} j$, and $\mathbf{v} j$ are the query, key, and value vectors at positions $t$ and $j$, respectively, and $\mathbf{o} t$ is the output at position $t$. By defining $S t :=\sum {j=1}^t \mathbf{k} j \mathbf{v} j^\top$, then we get a recurrent relation where $S t$ summarizes all past $ k j, v j $ pairs: \ S t=S {t-1}+\mathbf{k} t \mathbf{v} t^\top \Longrightarrow \mathbf{o} t = S t \mathbf{q} t = S {t-1}\mathbf{q} t+\mathbf{v} t\left \mathbf{k} t^\top \mathbf{q} t\right \ where $S {t-1}$ is the state from the previous time step. While this gets us closer to an RNN-esque structure, in practice, softmax stabilizes training, and the linear form can cause instability without normalization. With RNNs, it is sometimes helpful to forget the past, by introducing a gate $\mathbf{G} t$ for the previous state \ \mathbf{S} t=\mathbf{G} t \odot \mathbf{S} {t-1} + \mathbf{v} t\mathbf{k} t^\top\ where $\odot$ denotes elementwise multiplication and $\mathbf{G} t$ is a learned gating mechanism. Mamba-2 https://arxiv.org/abs/2405.21060 is among the most popular, being used in hybrid models like Nemotron-H https://arxiv.org/abs/2504.03624 and Falcon H1 https://arxiv.org/abs/2507.22448 . Hybrid models are becoming increasingly popular, notably in Qwen3-Next https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list with a gated DeltaNet update and Kimi’s next model, likely using their “kimi delta attention.” https://github.com/fla-org/flash-linear-attention/pull/621 architecture takeaways - Use a proven dense baseline unless you have strong reasons and infra to support MoE. - GQA with small groups is a robust default; MQA is cheapest but tends to underperform. - For long context, plan for RNoPE/YaRN plus document masking early in the recipe. - Hybrid architectures are promising but still harder to reason about and operationalize. stability Training stability is crucial for successful large-scale model training. Several techniques help prevent training failures, including regularization methods, careful initialization, and architectural choices. The following sections cover key stability mechanisms: $z$-loss $z$-loss is a regularization term added to the standard cross entropy loss that keeps logits from drifting to large magnitudes. The softmax denominator is $Z = \sum {i=1}^V e^{z i}$, and by adding $\mathcal{L} = \lambda \cdot \log^2 Z $ to the loss, we penalize based on $\log Z $ which represents the overall logit scale. On their 1B model, Hugging Face found that adding $Z$-loss didn’t impact training loss or downstream performance, so they chose not to include it due to training overhead. For logit stabilization, logit softcapping see below is generally preferred in modern recipes, following the Gemma 2 https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf and Gemma 3 https://arxiv.org/abs/2503.19786 models. logit softcapping Logit softcapping prevents logits from growing excessively large by mapping them into a bounded range via a smooth, differentiable transformation. Unlike hard clipping which has zero gradient at the boundaries and can cause training instability , softcapping uses $\tanh$ to compress values smoothly. The Gemma 2 report https://arxiv.org/abs/2408.00118 introduces the formulation used in production models: cap logits such that values stay within $ -\texttt{soft cap}, +\texttt{soft cap} $ using where $\texttt{soft cap}$ is the threshold hyperparameter controlling the output range. The division normalizes inputs before $\tanh$ and the multiplication by $\texttt{soft cap}$ rescales to the desired interval. Unlike $z$-loss which adds a regularization term to the loss , softcapping operates directly on activations in the forward pass Gemma 2 https://arxiv.org/abs/2408.00118 applies softcapping to both attention logits pre-softmax and the final language modeling head. They set $\texttt{soft cap}=50.0$ for attention layers and $\texttt{soft cap}=30.0$ for the final layer. The technique traces back to Bello et al., 2016 https://arxiv.org/abs/1609.08144 in the context of neural machine translation. However, one caveat is that logit softcapping is incompatible with Flash Attention / SDPA during training because those fused kernels assume standard attention. The Hugging Face Gemma 2 blog https://huggingface.co/blog/gemma2 notes that for stable fine-tuning, you must use attn implementation="eager" ; inference can still use SDPA with minimal quality difference. This writeup https://danieldk.eu/Machine-Learning/Building-Blocks/Logit-Softcapping gives a concise technical overview. weight decay and embeddings Despite being a regularization technique, removing weight decay from embeddings can improve training stability. Weight decay causes embedding norm to decrease, but this can lead to larger gradients in earlier layers since the LayerNorm Jacobian has a $\frac1{\sigma}$ term coming from normalization which is inversely proportional to the input norm $\sigma$. Hugging Face tested this using a weight decay baseline, a no weight decay baseline, and another combining all previous adopted changes and found no significant loss or eval results, so they included no weight decay. QK-norm Similar to $z$-loss, QK-norm helps prevent attention logits from becoming too large by applying LayerNorm to both the query and key vectors before computing attention. However, the same paper which proposed RNoPE https://arxiv.org/abs/2501.18795 found that it hurts long-context tasks because the normalization de-emphasizes relevant tokens and emphasizes irrelevant tokens by stripping the query-key dot product of its magnitude. RMSNorm RMSNorm maintains comparable performance to LayerNorm while being computationally simpler, due to avoiding the mean-centering computation. A variant called depth-scaled sandwich norm applies normalization both before and after the attention/MLP blocks, with the normalization scale adjusted based on the layer depth: where $\mathbf{x} \ell$ and $\mathbf{y} \ell$ are input/output of layer $\ell$, $\mathcal{M} \ell$ is the sublayer module like attention, FFN, or MoE . The RMSNorm gain, $\gamma$, is a multiplicative factor applied after the RMS normalization, given by $\bar{a} i = \gamma\frac{a i}{\text{RMSNorm a }}$. In Arcee’s case, they initialize $\gamma\left \text{RMSNorm} \ell^{ 1 }\right =1$ and $\gamma\left \text{RMSNorm} \ell^{ 2 }\right =\frac1{\sqrt{L}}$. This depth-dependent scaling accounts for the fact that activations evolve differently across layers. The sandwich pattern pre-norm and post-norm provides additional stability, especially in very deep networks where gradient flow can be challenging. Arcee also applies RMSNorm before the language modeling head stabilizes the final hidden states to ensure consistent output activation scales before they are transformed into token probabilities. other design considerations Parameter initialization : either normalization initialization with $\mu=0$ and clipping as TruncDNormal initialization does often with $\pm 2-3 \sigma$ or a scheme like $\mu\text{P}$ maximal update parametrization https://arxiv.org/abs/2011.14522 which dictates how weights and learning rates should scale with width so that training dynamics stay comparable.- The clipping prevents extreme initialization values that could destabilize training, which is particularly important for embedding layers where large initial activations can propagate through the network. - Another heuristic is setting $\sigma=\frac{0.5}{\sqrt{d}}$ where $d$ is model dimension, although the exact coefficient can vary. - During the forward pass, the embedding layer’s activations are scaled by $\sqrt{d}$: $\mathbf{e} T=\sqrt{d} E \text{tok} t $. This keeps embedding magnitudes in a stable range relative to the residual stream and is common in several transformer implementations. Notably, Grok-1 and Grok-2 checkpoints as well as Trinity Large and the first two generations of the Gemma models implement this. Activation Function : SwiGLU is what most modern LLMs use, not ReLU or GeLU; for example, gpt-oss-120b uses gated SwiGLU. Some exceptions are Gemma2 using GeGLU and nvidia using $\text{relu}^2$. Width vs Height : deeper models tend to outperform equally sized wider ones on language modeling and compositional tasks. In smaller models, this is more pronounced, but larger models make use of wider models for faster inference due to modern architectures supporting better parallelism. stability takeaways - Stabilization is mostly about sane defaults, not exotic tricks. Logit softcapping Gemma-style is the preferred method for attention/LM-head logit stabilization; $z$-loss and QK-norm are alternatives.- QK-norm can hurt long-context tasks; don’t assume it’s “always good.” - Initialization and normalization details matter more as depth grows. - Track loss spikes early; many “mystery failures” are configuration or data issues. tokenizer There are a few considerations that typically guide tokenizer design: domains : in domains like math and code, digits and other special characters require careful treatment. Most tokenizers do single-digit splitting, which helps with arithmetic patterns more effectively and prevents memorization of numbers. Some tokenizers like Llama3 https://arxiv.org/abs/2407.21783 further encode numbers 1 to 999 as unique tokens. supported languages : a tokenizer trained on english text would be extremely inefficient if it encountered another language, say mandarin or farsi. target data mixture : when training a tokenizer from scratch, we should train on samples that mirror our final training mixture. Larger vocabularies can compress text more efficiently, but they come at the cost of a larger embedding matrix, which as mentioned in the embeddings section, can take up a sizable portion of the parameter count. For english-only models, 50k is often enough, while multilingual models need over 100k. There is an optimal size that exists since compression gains from larger vocabularies decrease exponentially https://arxiv.org/abs/2402.01035 . Large models benefit from large vocabularies since the extra compression saves more on the forward pass project to QKV, attention, and MLP than the additional embedding tokens during softmax. For memory, larger vocab means fewer tokens, so a smaller KV cache. BPE byte-pair encoding https://arxiv.org/abs/1508.07909 still remains the de facto choice. Starting with tiny units e.g. characters or bytes , the BPE algorithm repeatedly merges the most common adjacent pair into a new token. To evaluate a tokenizer’s performance, fertility is a common metric, measuring the average number of tokens needed to encode a word alternatively, characters-to-tokens ratio or bytes-to-tokens ratio, but these have limitations due to word length variability and byte representations . Another is proportion of continued words , describing what percentage of words get split into multiple pieces. For both, smaller metrics indicate more efficient tokenizers. There are many strong existing tokenizers, like GPT4’s tokenizer https://arxiv.org/abs/2303.08774 and Gemma3’s tokenizer. Often, using existing tokenizers is enough; only when we want to train for low-resource languages or have a different data mixture should we continue training our own tokenizer. optimizers and training hyperparameters Choosing optimizers and tuning hyperparameters is notoriously time-consuming and significantly impacts convergence speed and training stability. While we may be tempted to distill those from models of larger labs albeit a useful prior , it may not fit the use case. adamW Despite being invented over 10 years ago, AdamW still stands the test of time. Adam adaptive momentum estimation updates weights individually based on an exponential weighted average of gradients $g t$ and an exponential weighted average of squared gradients $g t^2$, along with weight decay the “W” . The exponential moving averages provide adaptive learning rates per parameter: parameters with consistently large gradients get smaller effective learning rates via the squared gradient term , while parameters with small or noisy gradients get larger effective learning rates. This adaptivity helps stabilize training and converge faster: \ \begin{align } \theta &\leftarrow 1-\alpha \lambda \theta - \alpha \frac{\hat{m} t}{\sqrt{v t}+\epsilon} \\ \hat{m} t &= \frac{m t}{1-\beta 1^t}, \quad m t = \beta 1 m {t-1} + 1-\beta 1 g t \\ \hat{v} t &= \frac{v t}{1-\beta 2^t}, \quad v t = \beta 2 v {t-1} + 1-\beta 2 g t^2 \end{align }\ where $\theta$ denotes the model parameters, $\alpha$ is the learning rate, $\lambda$ is the weight decay coefficient, $g t$ is the gradient at step $t$, $m t$ and $v t$ are the first and second moment estimates exponentially weighted averages , $\hat{m} t$ and $\hat{v} t$ are bias-corrected versions, $\beta 1$ and $\beta 2$ are exponential decay rates for the moment estimates, and $\epsilon$ is a small constant typically $10^{-8}$ to prevent division by zero. Even for modern LLMs, the hyperparameters remain largely unchanged: weight decay factor $\lambda=0.1$ or $\lambda=0.01$, $\beta 1=0.9$, and $\beta 2=0.95$. muon Unlike AdamW which updates per-parameter, muon treats the weight matrix as a singular object and updates based on matrix-level operations. This approach reduces axis-aligned bias where optimization favors certain coordinate directions and encourages exploration of directions that would otherwise be suppressed. By considering the entire weight matrix structure rather than individual parameters, muon can better capture correlations between parameters: \ \begin{align } g t &\leftarrow \nabla \theta \mathcal{L} t \theta {t-1} \\ B t &\leftarrow \mu B {t-1} + G t \\ O t &\leftarrow \text{NewtonSchulz5} B t \\ \theta t &\leftarrow \theta {t-1} - \eta O t \end{align }\ where $\theta t$ denotes the model parameters at step $t$, $\mathcal{L} t$ is the loss function, $g t$ is the gradient matrix, $G t$ is the normalized gradient matrix typically $G t = g t / |g t|$ , $B t$ is a momentum buffer matrix with $B 0=0$, $\mu$ is the momentum coefficient, $\eta$ is the learning rate, and $\text{NewtonSchulz5}$ applies the odd function $f x =3.4445x-4.7750x^3+2.0315x^5$. This blog https://docs.modula.systems/algorithms/newton-schulz/ and this blog https://kellerjordan.github.io/posts/muon/ describe the algebra of it in more detail as well as why the coefficients are what they are. The Newton-Schulz iteration approximates the matrix sign function: we can estimate the SVD decompositions of $G=U \Sigma V^\top$ by $UV^\top$, and $f x $ essentially replaces $\Sigma$ because iteratively applying $f$ i.e., $f \circ f \circ \cdots f x $ converges to the sign function, which normalizes the singular values. This has the effect of reducing axis-aligned bias and encouraging exploration of directions that would otherwise be suppressed. Muon is more sample-efficient than AdamW, especially at large batch sizes where AdamW struggles. Some implementations, including Arcee’s Trinity Large, choose a hybrid approach: using muon for hidden layers while keeping AdamW for embedding and output layers. This decision stems from the different optimization dynamics these layers exhibit—embeddings and output projections benefit from per-parameter adaptive learning rates, while hidden layers capture more benefit from muon’s matrix-level structure awareness. But since muon operates at the matrix level, applying NewtonSchulz requires access to the full gradient tensor. One method uses an overlapping round-robin scheme where each rank is responsible for gathering all gradient matrices corresponding to its index and applying muon locally. Since FSDP expects sharded gradients/updates, and every rank has its shard of the muon-updated gradient, then the optimizer step can proceed normally. However, this issues lots of overlapping collectives across many matrices which breaks at scale. The alternative that Prime adapts is based on all-to-all collectives which does bulk permutation so that each rank temporarily owns full gradients for its matrices, runs muon, then bulk permutes them back. This may require padding since many tensors are packed into contiguous buffers which can change the size that’s expected. However, this requires fewer collectives and scales better. Building on Muon, Kimi K2 introduces MuonClip , a stabilization technique that prevents exploding attention logits, which is a common failure mode in large-scale training. Other strategies include logit soft-cap https://arxiv.org/abs/2408.00118 , which applies $\tanh$ clipping to the pre-softmax logits, or QK-norm, which applies LayerNorm to the QK matrices. However, these lead to issues of the scaled dot-product exploding making bounding too late and distorted gradients around regions where the model is unstable in logit soft-cap, and key matrices are not materialized during inference projected from a latent variable . For each attention head $h$, consider $\mathbf{Q}^h$, $\mathbf{K}^h$, and $\mathbf{V}^h$ the query, key, and value matrices for head $h$ . For a batch $B$ and input representation $\mathbf{X}$, define the max logit as a per-head scalar to be the maximum input to softmax where $d$ is the dimension of the query/key vectors, $i$ and $j$ index positions in the sequence, and the $\frac1{\sqrt{d}}$ scaling factor matches the standard attention scaling. Set $S \text{max} = \max h S \text{max}^h$ the maximum across all heads and target threshold $\tau$ a hyperparameter controlling when clipping activates . The idea is to rescale $\mathbf{W} k^h$ and $\mathbf{W} q^h$ the key and query projection weight matrices for head $h$ whenever $S \text{max}^h$ exceeds $\tau$. Also, $\gamma=\min 1, \frac{\tau}{S \text{max}} $ the global clipping factor , one approach is to clip all heads simultaneously by \ \mathbf{W} q^h \leftarrow \gamma^\alpha \mathbf{W} q^h \quad \mathbf{W} k^h \leftarrow \gamma^{1-\alpha} \mathbf{W} k^h\ where the $\gamma$ exponentials enforce multiplicative weight decay for $\mathbf{Q}^h \mathbf{K}^{h\top}$; commonly, $\alpha=0.5$ to ensure equal scaling to queries and keys. However, not all heads exhibit exploding logits, which motivates a per-head clipping based on $\gamma h = \min 1, \frac{\tau}{S \text{max}^h} $, which is more straightforward for MHA but less for MLA. The challenge with MLA is that keys are projected from a latent variable rather than materialized directly, so clipping must be applied to the latent-to-key projection weights and the latent variable itself. They apply clipping only on $\mathbf{q}^C$ and $\mathbf{k}^C$ head-specific components scaled by $\sqrt{\gamma h}$, $\mathbf{q}^R$ head-specific rotary scaled by $\gamma h$, and $\mathbf{Q}^R$ shared rotary . Besides that, the main muon algorithm is modified to match Adam RMS and enable weight decay. For each weight $\mathbf{W} \in \mathbb{R}^{n \times m}$: \ \begin{align } g t &\leftarrow \nabla \theta \mathcal{L} t \theta {t-1} \\ B t &\leftarrow \mu B {t-1} + G t \\ O t &\leftarrow \text{NewtonSchulz5} B t \cdot \sqrt{\max n,m } \cdot 0.2 \\ \theta t &\leftarrow 1-\eta \lambda \theta {t-1} - \eta O t \end{align }\ where $n$ and $m$ are the dimensions of the weight matrix $\mathbf{W}$, $\sqrt{\max n,m } \cdot 0.2$ is a scaling factor that adapts the update magnitude to the matrix size matching Adam’s RMS scaling behavior , and other symbols follow the same definitions as in the standard Muon algorithm. The weight decay term $ 1-\eta \lambda $ is applied multiplicatively before the gradient update. Figure 6 : Left: a mid-scale training run on a 9B active, 53B total MoE where attention logits diverge quickly. Right: maximum logits for KimiK2 with MuonClip and $\tau=100$, where max logits eventually decays to a stable range after ~30% of the training steps. From Kimi K2 https://arxiv.org/pdf/2507.20534 . learning rates Learning rates have their own life cycle: they warmup typically 1%-5% of training steps for short trainings, but large labs fix the warmup steps from zero to avoid chaos, then anneal after settling into a good minimum. Cosine annealing https://arxiv.org/abs/1608.03983 was the go-to scheduler, but it’s also inflexible due to the cosine period needing to match the total training duration. Alternatives include warmup-stable-decay WSD https://arxiv.org/abs/2404.06395 and multi-step https://arxiv.org/abs/2401.02954 ; in the last x% of tokens, the former linearly decays the learning rate whereas multi-step does discrete drops. For WSD, typically 10-20% is allocated for the decay phase, matching cosine annealing; in multi-step, 80/10/10 also matches cosine annealing while 70/15/15 and 60/20/20 can outperform it. Deepseek-v3 used cosine annealing between the decay drops and added a constant phase before the final sharp step. Figure 7 : Comparison of learning rate schedules: cosine annealing, WSD, and multi-step. From Hugging Face https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook . Hugging Face’s ablations on their 1B model showed that WSD tended to underperform cosine annealing before WSD’s decay began, but once it entered its decay phase, WSD showed nearly linear improvement in both loss and eval metrics, which allowed it to catch up to cosine annealing by the end. After running further ablations on the learning rate, the Hugging Face team settled on 2e-4; increasing led to potential increased risk of instability during long training runs. Kimi K2 also uses WSD: the first 10T were trained with 2e-4 learning rate after a 500 step warm up, then 5.5T tokens with cosine decay from 2e-4 to 2e-5. WSD schedule especially helps with ablations since it does not require restarting the same run for different token counts, since we can retrain only the end portions learning rate decay while maintaining the front portion. batch size There is a critical batch size https://arxiv.org/abs/1812.06162 : too small and we may be underutilizing compute, but too large and the model needs more tokens to reach the same loss. Still, larger batch sizes give more efficient gradient estimations, and are preferred. A useful proxy is that for optimizers like AdamW or Muon, if the batch size increases by a factor of $k$ then the learning rate should scale up by $\sqrt{k}$. Intuitively, larger batches provide more stable gradient estimates lower variance , so we can afford larger step sizes. Mathematically, the covariance shrinks by a factor of $k$, and based on the SGD parameter update $\Delta w = -\eta g B$, we have $\text{Var} \Delta w \sim \eta^2 \frac{\Sigma}{B}$ where $B$ is the original batch size. To maintain the same update variance, we need $\eta \sim \sqrt{k}$. As training progresses, the critical batch size grows. Initially, since the model is making large updates, $\lvert \lvert g \rvert \rvert^2$ is large so the model should have a small critical batch size. After the model stabilizes, larger batches become more effective. This motivates the idea of batch size warmup . Imbalanced minibatches can arise when sequence packing or data distribution creates batches with highly variable sequence lengths or domain compositions, which can cause gradient variance that destabilizes training; this is especially true when certain experts or model components receive disproportionately many or few tokens. Arcee introduces random sequential document buffer RSDB to reduce intra-batch correlation. After tokenizing a document, it works by loading the token sequence as an entry in the RSDB with a read head at index 0; this is repeated until the RSDB is full. From a randomly sampled index in a randomly sampled document from the RSDB, tokens are read based on the read head and the index and added to a separate sequence buffer. Read head positions are updated, and if the sequence buffer is full, we return; otherwise, we randomly select another document index and continue to read tokens into the sequence buffer, repeating until the sequence buffer is full. The internal buffer size in Trinity Large: 8192 per GPU is set to twice the user-specified buffer value and refilled when the buffer reaches the user-specified value in Trinity Large: 4096 per GPU or when old documents need to be purged/new documents can be loaded. Arcee found that this optimization significantly improved dataloader performance. scaling laws Scaling laws e.g. Chinchilla scaling laws https://arxiv.org/abs/2203.15556 provide a useful proxy for determining how aggressively/conservatively to update hyperparameters as model size scales. First, $C \approx 6 \cdot N \cdot D$ where $C$ is the compute budget measured in FLOPs, N is the number of parameters, and $D$ is the number of training tokens. The 6 is derived from empirical estimates for the number of FLOPs per parameter. Figure 8 : Scaling curves of batch size and learning rate. From DeepSeek https://arxiv.org/abs/2407.05065 . Initially, scaling laws https://arxiv.org/abs/2001.08361 indicated that language model size was the main constraint, leading to a GPT-3 model with 175B parameters but only trained on 300B tokens. A re-derivation https://arxiv.org/abs/2203.15556 found that training duration could improve gains more than size; they found that compute-optimal training of GPT-3 should have consumed 3.7T tokens. However, scaling laws are almost never religiously followed. Recently, labs have been “overtraining” models beyond the training durations suggested by scaling laws e.g. Qwen 3 being trained on 36T tokens . Moreover, “compute-optimal” scaling laws don’t account for larger models being more expensive after training due to inference. To that end, Hugging Face decided to train on 11T tokens on a 3B model. For comparison, Kimi K2’s 1T model comprised of 15.5T pre-training tokens. While general scaling laws provide guidance, Kimi K2’s scaling law analysis revealed model-specific insights. They showed that an increase in sparsity , the ratio of total number of experts to the number of activated experts, yields substantial performance improvements for fixed FLOPs, so they increase the number of MoE experts to 384 256 in DeepSeek-V3 while decreasing attention heads to 64 128 in DeepSeek-V3 to reduce computational overhead during inference. They settle on a sparsity of 48, activating 8 out of 384 experts and found that decreasing the attention heads from 128 to 64 sacrificed a validation loss ranging from 0.5% to 1.2%, but a 45% decrease in inference FLOPs. data curation and pre-training Even with the perfect architecture, a model’s performance is still heavily dependent on its training data; no amount of compute or optimization can compensate for training on the wrong content. To this end, it’s about assembling the right data mixture , balancing training objectives and tuning data proportions. This is particularly difficult since across competing objectives, for a fixed compute budget, increasing one proportion necessarily decreases another, hurting performance. There already exist large corpora of pre-training datasets like FineWeb2 https://arxiv.org/abs/2506.20920 and The Pile https://pile.eleuther.ai/ . However, there are still a plethora of information gaps, so recent models additionally rely on specialized pretraining datasets for domains like math and coding. One consideration is data quality . Of course, training on the highest quality data possible is preferable. But for a training budget of $X$ tokens, because high quality data is limited, only filtering for it would lead to repeated data, which can be harmful https://arxiv.org/abs/2305.16264 . So, an ideal mixture includes both higher and lower quality data. Another consideration is model safety . For gpt-oss-120b, OpenAI addresses this by filtering the data for harmful content in pre-training, with an emphasis on hazardous biosecurity knowledge. They use CBRN chemical, biological, radiological, and nuclear pre-training filters that were used in GPT-4o. multi-stage training Multi-stage training https://arxiv.org/abs/2502.02737 , the idea of evolving the data mixture as training progresses, can better maximize both high-quality and lower-quality data compared to a static mixture because a LM’s final behavior is heavily dictated by the data it sees at the end of training https://arxiv.org/abs/2410.08527 . So, this motivates the strategy of saving the higher quality data towards the end. This introduces another variable of when to begin changing mixtures, and a general principle to performance-driven intervention : if a benchmark begins to plateau, it’s a signal to introduce high-quality data for that domain. ablation While architectural ablations are done on smaller models e.g. on 1B models to train for 3B models , data mixture ablations are done at scale because larger models have much larger capacities to understand a variety of domains. Moreover, annealing ablations are done on checkpoints of the main run like 7T out of 11T tokens to determine what datasets to introduce when. To determine optimal data proportions, recent models often use a validation loss or a holdout loss to minimize based on evaluation objectives and data domains. However, some of these methods tend to converge toward distributions that mirror the dataset size distribution, and they don’t outperform careful manual ablations. token utility Token efficiency is how much performance improvement is achieved per token consumed during training. This can be improved via better token utility , the effective learning signal each token contributes; this motivates finding the optimal balance of high-quality tokens, since they should be maximally leveraged but also limited to prevent overfitting and reduced generalization. Kimi K2 uses data rephrasing in knowledge and math domains. For knowledge, this comes in the form of style and perspective-diverse prompting to rephrase the texts, chunk-wise autoregressive generation to gradually build a rephrased version of long documents, and fidelity verification to ensure semantic alignment. In the main training run, each corpus is rephrased at most twice. For math, diversity is increased via rephrasing into a “learning-note style” and translation into other languages. pre-training data SmolLM3 Hugging Face’s goal was to build a multi-lingual model that also excels on math and coding. In stage 1 of their multi-stage training, they use a 75/12/10/3 split among english web data, multilingual web data, code data, and math data. English web data : they ablate on a mixture of FineWeb-Edu educational and STEM benchmarks and DCLM common sense reasoning , two strong open English web datasets at the time of training, finding that a 60/40 or a 50/50 split was best. Later, they add in other datasets including Pes2o https://huggingface.co/datasets/allenai/dolmino-mix-1124/tree/main/data/pes2o , Wikipedia & Wikibooks https://huggingface.co/datasets/allenai/dolmino-mix-1124/tree/main/data/wiki , and StackExchange https://huggingface.co/datasets/HuggingFaceTB/stackexchange 2025 md . Multilingual web data : five European languages were chosen, with data from FineWeb2-HQ. Smaller portions of other languages, like Chinese or Arabic, were chosen to allow others to do continual pretraining of SmolLM3. Ultimately, they found that 12% multilingual content in the web mix was best. Code data : primarily extracted from The Stack v2 and StarCoder2 https://arxiv.org/abs/2402.19173 , it includes 16 languages, Github PRs, Jupyter/Kaggle notebooks, Github issues, and StackExchange threads. Despite research showing that code improves LM performance beyond coding, they did not observe this effect rather a degradation on English benchmarks using the recommended code mixture. They delay adding their educationally filtered subset, Stack-Edu, following the principle of delaying the best data until the end. Math data : using FineMath3+, InfiWebMath3+, MegaMath https://arxiv.org/abs/2504.02807 , and instruction/reasoning datasets like OpenMathInstruct https://arxiv.org/abs/2402.10176 and OpenMathReasoning https://arxiv.org/abs/2504.16891 . For new stages using a checkpoint at around 7T out of the total 11T tokens , they use a 40/60 split between the baseline mixture and the new dataset. SmolLM3 has three stages: 8T tokens @ 4k context for base training, 2T tokens @ 4k context for high-quality injection, and 1.1T tokens @4k context a reasoning/Q&A stage. hermes 4 Using data from DCLM https://arxiv.org/abs/2406.11794 and FineWeb, Nous first performs semantic deduplication using embeddings at a cosine similarity of 0.7, and then uses an LLM-as-judge to filter out incomplete or ill-formatted messages. Then, they process pre-training data through DataForge , a graph-based synthetic data generator, which allows for large and complex structures. By taking a random walk through a directed acyclic graph where nodes implement a mapping from struct $\to$ struct such that if there is an edge from node $A$ to node $B$, the postconditions guaranteed by $A$ must satisfy the preconditions of $B$. QA pairs are generated using this workflow with intermediary transformations into other mediums e.g. a wikipedia article into a rap song , question generation and then questions/answers annotations using an LLM-as-judge to grade the instruction and response. Also, to find a covering set of data-scarce domains of special interest, they recursively depth-first-search generate a taxonomy of subdomains where the leaves are prompts and the LLM enumerates $n$ subdomains to form a partition. The DataForge-generated data is used in both pre-training and post-training stages, with specific details provided in the post-training data section below. data takeaways - Data quality and mixture often dominate architecture tweaks at fixed compute. - Multi-stage schedules help: save the best data for late training to shape final behavior. - Deduplication and contamination checks are non-optional if you care about honest evals. - Ablate data mixtures at scale; small-model ablations can be misleading. mid-training Mid-training is the intermediary step between pre-training and post-training where the base model is trained further on a large amount of domain-specific tokens, especially shaping the model to focus on common core skills like coding or reasoning. Often-times, the decision to mid-train is only made after initial SFT experiments are run because they may reveal performance gaps that indicate the need to mid-train on certain domains. But if the goal is to elicit shallow capabilities like style or conversation, the compute is better spent in post-training. Some recipes include an additional long context stage ; for example, Qwen3 https://arxiv.org/abs/2505.09388 first trained on 30T tokens at 4k context, then a reasoning stage with 5T higher-quality tokens mainly on STEM and coding, and finally a long context stage at 32k context length. SmolLM3 also does this, but instead of scaling from 4k to 128k directly, they sequentially scale from 4k to 32k to 64k to 128k, which allows the model to adapt at each length before pushing the context length further. Upsampling long context documents like web articles or books improve long context https://arxiv.org/abs/2410.02660 , but Hugging Face didn’t observe improvement; they hypothesize that this is because their baseline mixture already includes long documents using RNoPE. To go from 4k to 32k and later to 64k, they use RoPE ABF and increase the base frequency to 2M and 5M, respectively. Base frequencies like 10M further improved slightly on RULER https://arxiv.org/abs/2404.06654 , long context benchmark, but it hurt short context tasks like GSM8k, so they were disregarded. To reach 128k, they found that using YARN from the 64k checkpoint instead of using a four-fold increase from 32k produced better performance, which confirms the hypothesis that training closer to the desired inference length benefits performance. Kimi K2 decays learning rate from 2e-5 to 7e-6, training on 400B tokens with 4k sequence length, then 60B tokens with a 32k sequence length. To extend to 128k, they use YARN. While the mid-training data usually comes from web data, another powerful approach is to use distilled reasoning tokens from a better model, as Phi-4-Mini-Reasoning did from DeepSeek-R1. When applied to the base model, distilled mid-training increased benchmark scores like AIM24 by 3x, MATH-500 by 11 points, and GPQA-D by almost 6 points. SmolLM3 also does distilled mid-training. They considered datasets including reasoning tokens from DeepSeek-R1 4M samples and QwQ-32B 1.2M samples but decide to delay using the Mixture of Thoughts https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts dataset until the final SFT mix. They found that it almost always makes sense to perform some amount of mid-training if the base model hasn’t already seen lots of reasoning data during pre-training, because they noticed that /no think reasoning mode also had improvements on reasoning benchmarks. post-training evals Given today’s standards of LLMs as coding agents and assistants that can reason, there are four broad classes of evals that researchers care about: Knowledge : for small models, GPQA Diamond tests graduate-level multi-choice questions and gives better signal than other evals like MMLU. Another good test for factuality is SimpleQA, although smaller models are much less performant due to limited knowledge. Math : AIME is still the leading benchmark, with others like MATH-500 providing a useful sanity check for small models. Code : LiveCodeBench tracks both coding competency via competitive programming while SWE-bench Verified is a more sophisticated alternative but much harder for smaller models. Multilinguality : there aren’t many options except for Global MMLU to target the languages that models were pretrained on/should perform well in. These evals test the following: Long context : RULER, HELMET, and more recently-released MRCR and GraphWalks benchmark long-context understanding. Instruction following : IFEval uses verifiers against verifiable instructions, and IFBench extends upon it with a more diverse set of constraints. For multi-turn, Multi-IF and MultiChallenge are preferred. Alignment : LMArena with human annotators and public leaderboards is the most popular. But due to the cost of these evaluations, LLM-as-judge evals have emerged, including AlpacaEval and MixEval. Tool calling : TAU-Bench tests a model’s ability to use tools to resolve user problems in customer service settings, including retail and airline. To prevent overfitting, evals that encapsulate robustness or adaptability, like GSMPlus which perturbs problems from GSM8k, are also included. Another way is using interval evals or vibe evaluations/arenas , such as manually probing the model’s behavior. Other tips include using small subsets to accelerate evals especially if there’s correlation with a larger eval , fixing the LLM-as-judge model if the eval requires it , treat anything used during ablations as validation, use avg@k accuracy, and try not to don’t benchmax post-training data intellect 3 It’s first worth mentioning that Intellect-3 is a 106B parameter MoE 12B activate post-trained on top of GLM-4.5-Air base model from Z.ai, and that they have their own post-training stack including prime-rl , an open framework for large-scale asynchronous RL, verifiers library for training and evals from their Environments Hub, sandbox code execution and compute orchestration. Integrating with the Environments Hub, Prime trains on a diverse and challenging mix of environments designed to improve coding and reasoning capabilities. For math, they design an environment with long CoT reasoning in mind, consisting of 21.2K challenging math problems from Skywork-OR1, Acereason-Math, DAPO, and ORZ-Hard, all of which are curated datasets derived from AIME, NuminaMath, Tulu3 math, and others which test difficult math questions from multiple choice to proofs to those involving figures. Even using verifiers, there were a non-trivial amount of false negatives, so they additionally use opencompass/CompassVerifier-7B as a LLM-judge verifier. For science mainly physics, chemistry, and biology , they filter 29.3K challenging problems from MegaScience https://arxiv.org/abs/2507.16812 while also using LLM-judge verification and standard math verifiers. For logic games like Sudoku or Minesweeper , 11.6K problems and verifiers were adapted from SynLogic https://arxiv.org/abs/2505.19641 . For code, they primarily use their Synthetic-2 dataset https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2 along with Prime Sandboxes to verify solutions. They also develop two SWE environments that support scaffolding for common formats like R2E-Gym https://arxiv.org/abs/2504.07164 , SWE-smith https://arxiv.org/abs/2504.21798 , and Multi-SWE-bench https://arxiv.org/abs/2504.02605 to fix issues within a Github project when equipped to Bash commands and edit tooling. Also, the maximum number of turns for the agent is set at 200. Prime also focuses on its deep research capabilities via their web search environment, which provides the model with a set of search tools. The environment tasks the model with answering questions from the dataset using tools and is rewarded either 1 or 0 using z-AI’s DeepDive dataset https://huggingface.co/datasets/zai-org/DeepDive , with 1K samples for SFT trajectory generation and 2.2K samples for RL. When tested in Qwen/Qwen3-4B-Instruct-2507, 26 steps of SFT with batch size of 34 followed by 120 steps of RL at a group size of 16 and batch size of 512 was enough to reach mean reward of 0.7. hermes 4 They use 300k prompts, mostly STEM and coding from WebInstruct-Verified https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified , rSTAR-Coder https://arxiv.org/abs/2505.21297 , and DeepMath-103k https://arxiv.org/abs/2504.11456 and apply deduplicating and filtering for prompts with 2k characters. Nous rejection samples against ~1k task-specific verifiers using Atropos https://nousresearch.com/introducing-atropos/ . Some environments used to generate the dataset include Answer Format Training : rewards succinctly-presented final answers, like $\mathtt{\backslash boxed{}}$ in LaTeX, but there are over 150 output formats sampled. The environment also enforces