Loss Exploded.

Meta's FAIR team documented a series of training failures in 2021 for their OPT-175B model, including repeated loss explosions and learning issues that required extensive hyperparameter tuning and architecture swaps. In 2026, DeepSeek's new V4 model, a trillion-parameter MoE system with 1M token context and near-frontier performance, also faced significant stability challenges during training on 30 trillion tokens. The company acknowledged that their fixes—expert routing based on stale parameters and clipping—were temporary solutions, and they called for further community research into the underlying causes of training instability.

If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B Logbook.pdf should top your list. The first few runs are basically: - Loss exploded. - Doesn’t learn. - Loss exploded. - Etc. At each point the team tweaks some of the hyperparameters: learning rates, weight decay, clipping and so on, as well as adjusting how the model is distributed with various parallelisms. They swapped parts of the architecture GELU to RELU for example , dealt with hardware failures, avoided bad data, and tried to debug what was going on. That was 2021 though; by now in 2026 we commonly train on tens-of-thousands of much more powerful GPUs. We have a broadly available body https://huggingface.co/spaces/nanotron/ultrascale-playbook of knowledge https://arxiv.org/pdf/2603.07685 on how to train massive models. The big labs are now full of serious people debating the moral meaning of perplexity with their in-house philosophers. But the big labs don’t share those details. Of the the frontier-ish labs DeepSeek continue to be unusually open about their work. Their new one, DeepSeek v4 https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek V4.pdf , is pretty great 1M tokens of context and close to frontier performance across multiple domains. Training, however, was… not smooth: “Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes.” One helpful dynamic of model training is that you can often validate what will happen in a big model by training in a small model. But not always . It mostly works for architectural tweaks and allows much more rapid experimentation and testing. But when it doesn’t work you can be in trouble: things that appear to smooth out problems at small scale can mask others at large scale where models have the capacity to learn weirder things. Fixing them late in training, when you are already into the gigawatts, hurts. The techniques DeepSeek used expert routing based on stale params, and clipping did seem to get them through training a massive model on 30-odd trillion tokens, which is an incredible accomplishment. But they are arguably bandaids, as the team readily call out: “Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community” Which has echoes of Noam Shazeer’s similar observation https://arxiv.org/pdf/2002.05202v1 for SwiGLU: “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”