Feature Lottery? A Bifurcation Theory of Concept Emergence

Researchers have developed a bifurcation theory that detects when neural networks acquire structured representations during training in real time, without relying on retrospective or label-dependent metrics. The theory identifies a universal, label-free phase coordinate—a dynamic ratio computed from hidden states—that predicts four distinct transition regimes across diverse settings including language models, self-supervised learning, and grokking. This framework provides an early-warning indicator for training health, detecting the onset of usable structure and representational collapse epochs before downstream metrics react.

arXiv:2605.24057v1 Announce Type: new Abstract: Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing $\beta c$ that, compared to the network's current state $\beta$ , yields a dynamic ratio $\beta t /\beta c t $: a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models Pythia , SSL CIFAR , and grokking modular arithmetic . Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the "feature lottery" in SAE training: a feature's terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, $\beta/\beta c$ provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.