[Day 7] Does Giving an AI More 'Thinking Time' Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

The article investigates whether increasing an AI model's "thinking time" through additional recurrent loops during inference improves its performance, using a small-scale OpenMythos model trained on multi-digit addition. The author explains that OpenMythos is a theoretical, community-driven PyTorch reconstruction of Anthropic's unreleased Claude Mythos architecture, not the actual model. Experimental results show that accuracy peaks at exactly four loop iterations, with fewer loops causing underthinking and more loops leading to overthinking, suggesting that optimal performance depends on finding the right balance of recurrent depth.

Day 7 Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX Intro Day 7 Reddit kept surfacing this new project called OpenMythos in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in. Tools used: my home AI machine DGX Spark + OpenMythos PyTorch reconstruction of the rumored Claude Mythos architecture + synthetic multi-digit addition. The question: does giving an AI more "thinking time" = more recurrent loops at inference actually make it smarter? Today's setup The hype On 2026-04-07, Anthropic announced Claude Mythos . Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition Project Glasswing — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations rather than releasing it publicly. Twelve days later, Kye Gomez Swarms released OpenMythos , a PyTorch reconstruction of the suspected architecture. The repo is explicit upfront: "an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic" So OpenMythos is not Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer RDT with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation. Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the architectural idea — recurrent depth — actually works. Note for this article: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task." Three perspectives on looped transformers Browsing the literature, I found three different studies giving different pictures of how looped transformers behave: | Source | Scale | Claim | |---|---|---| Saunshi et al. 2025 ICLR, research paper | tens of M params, synthetic | Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math | Geiping et al. 2025 Huginn, research paper | 3.5B params, 800B tokens | Task-dependent: at scale on natural-language benchmarks, gains can be marginal T=4 → T=32 only +1.82 points on GSM8K , though effects vary by task and compute regime | Micheal Bee 2026-04 Medium, independent experiment blog | 17M params, 12 GPU-hours on RTX 5070 Ti | Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape | Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition. What I'd hoped to see - Does training-time accuracy phase-transition grok at some step? Saunshi 3-stage prediction - Does test-time loop count matter? At what point does it stop helping? - Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? the Bee question Headline finding - Loops help, but only within a narrow window centered on the training loop count. With training-time max loop iters=4 , accuracy peaks at exactly T=4 100% across all digit counts and decays in both directions — fewer loops underthink, more loops overthink. - Bee's "T=2 fixed-point" reproduced. Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32. - Striking per-seed grokking variance. Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at <10% until step 16,000, then jumps to 100%. - No depth extrapolation in this setup. Saunshi's claim that training at T=4 should generalize to deeper T at inference does not reproduce here — instead, T 4 hurts. 🌀 What is a "looped" transformer? A standard transformer GPT-4, Llama, most local LLMs routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count. A looped transformer reuses the same parameters across multiple iterations. The model has a Prelude → Recurrent Block × T → Coda structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers. Input tokens ↓ Prelude P — standard layers, run once ↓ Recurrent Block R — one block looped T times ↑ ↓ h {t+1} = A·h t + B·e + Transformer h t, e ↓ Coda C — standard layers, run once ↓ Output logits At each loop iteration t , the hidden state updates via the LTI injection rule, and the encoded input e Prelude output is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ A < 1, which prevents divergence over many loops Parcae stability framework . The key claim: more loops at inference = deeper reasoning, without adding parameters . This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space. 🔧 Experimental setup I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena. OpenMythos tiny 3.4M params ↓ Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark GB10 ↓ Experiment A: greedy autoregressive accuracy loops ∈ {1, 2, 4, 8, 16, 32} × digits ∈ {2, 3, 4, 5} ↓ Experiment B: cosine similarity between consecutive hidden states ⇒ does the recurrent block reach a fixed-point? ↓ Compare against Saunshi / Huginn / Bee Model config MythosConfig vocab size=16, digits 0-9 + '+', '=', pad, eos dim=256, n heads=8, n kv heads=2, GQA max seq len=32, max loop iters=4, training depth; inference varies prelude layers=1, coda layers=1, attn type="gqa", n experts=4, MoE FFN inside recurrent block n shared experts=1, n experts per tok=2, expert dim=512, lora rank=8, depth-wise LoRA per loop step Total parameters: 3,386,658 ~3.4M . Data On-the-fly synthetic addition. Operands are uniformly sampled from 10^ d-1 , 10^d - 1 for digit count d ∈ {2, 3, 4, 5} . Sequence format "A+B=R$" , where R = str A+B ::-1 reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally . Loss is applied only at positions following the = token i.e., on the answer tokens . Training - Optimizer: AdamW, betas 0.9, 0.95 , wd 0.1 - LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5 - Grad clip: 1.0 - Batch size: 128 - Max steps: 30000 - dtype: fp32 Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as complex64 buffers, and model.to bfloat16 silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling. Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s vs ~50K solo , but wall-clock time for all four is approximately equivalent to one solo run. 📊 Results Experiment A: accuracy heatmap Mean fully-correct rate across 4 seeds, 500 eval samples per condition: | Inference loops | d=2 | d=3 | d=4 | d=5 | |---|---|---|---|---| | 1 | 0.38 ± 0.12 | 0.19 ± 0.09 | 0.09 ± 0.07 | 0.02 ± 0.02 | | 2 | 0.53 ± 0.17 | 0.50 ± 0.12 | 0.16 ± 0.08 | 0.21 ± 0.16 | 4 train | 1.00 | 1.00 | 1.00 | 1.00 | | 8 | 0.98 ± 0.01 | 0.98 ± 0.01 | 0.94 ± 0.03 | 0.86 ± 0.08 | | 16 | 0.91 ± 0.04 | 0.91 ± 0.05 | 0.75 ± 0.10 | 0.56 ± 0.16 | | 32 | 0.62 ± 0.12 | 0.65 ± 0.13 | 0.45 ± 0.13 | 0.26 ± 0.17 | Observations: - Peak is exactly at training-time loop count T=4 , 100% across all digit counts. - One step of inference-time extrapolation T=8 is near-peak but already shows degradation at d=5 86% . - Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%. - Under-looping T=1, T=2 hurts more at higher digit counts, consistent with depth being needed to chain carries. Experiment B: fixed-point analysis Mean cosine similarity between consecutive hidden states cos h t, h {t-1} over answer positions, averaged across 4 seeds, 200 samples per digit: | t | d=2 | d=3 | d=4 | d=5 | |---|---|---|---|---| | 1 | 0.711 | 0.726 | 0.745 | 0.744 | | 2 | 0.961 | 0.967 | 0.957 | 0.946 | | 3 | 0.985 | 0.986 | 0.977 | 0.971 | | 4 | 0.993 | 0.992 | 0.986 | 0.983 | | 8 | 0.999 | 0.999 | 0.998 | 0.996 | | 16 | 0.9995 | 0.9996 | 0.9992 | 0.998 | | 32 | 0.9995 | 0.9996 | 0.999 | 0.998 | Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 vs. Bee's near-1.0 , then asymptotes to ~0.99 by T=4 and stays flat through T=32. The difference vs. accuracy is telling: hidden state is effectively static by cosine similarity from T=4 onwards, yet accuracy collapses at T=16-32 . Two non-exclusive interpretations: a overthinking — late loops drift away from a converged solution; b distribution shift — training used T=4, so T 4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate. Digit-count dependence on fixed-point timing is small d=5 lags d=2 by ~0.01 in cosine sim . "Harder problems take more loops to converge" is not observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts. Bonus: training dynamics The most striking thing in the training curves is seed-dependent grokking timing . Four runs of identical hyperparameters: - seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000 - seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000 - seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from <10% to 99% in 2,000 steps - seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000 This is textbook Saunshi-style three-stage grokking memorization → in-distribution → systematic , with the third-stage trigger varying by a factor of 4x in step count purely on random init. The largest seed gap seed 0 vs. seed 1 is ~12,000 steps, roughly 1 hour of wall-clock on this DGX. If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture can solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution. 💡 What this means for the three perspectives Where my data point lands My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi: - Bee's fixed-point at small T is reproduced. Hidden state effectively stops evolving by T=4 cosine sim ≥ 0.99 and certainly by T=8. - Saunshi's depth-extrapolation does NOT reproduce. Inference at T train T does not improve accuracy — it harms it. T=8 is already at 86% on d=5 vs. 100% at T=4 , and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not. - Huginn's limited-gain finding is consistent at small scale. Extra inference loops give negative ROI rather than diminishing positive ROI. - New observation: seed-dependent grokking with up to 12K-step variance. This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies Bee's solo replication, individual rows in Saunshi's tables may be substantially under- or over-estimating typical behavior. Reconciliation attempt Theory Saunshi , large-scale empirics Huginn , and independent replication Bee may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales: - Saunshi : shows loops can work on the right kind of problem algorithmic, depth-bounded reasoning at the right kind of scale small synthetic . - Huginn : shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark GSM8K that already favors CoT. - Bee : shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference. These three findings are compatible with a unified picture: loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity . Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information. What I'd watch next - Increase loop count during training here I used 4 and see if the inference-time scaling extends further - Try ACT halting more aggressively to see how the model self-regulates loop depth per token - Add task heterogeneity mix p-hop induction or parity to test whether the fixed-point timing varies by problem class 🛠️ Technical details Reproducing this experiment git clone https://github.com/kyegomez/OpenMythos cd OpenMythos pip install -e . Data, training, evaluation scripts this Day 7 folder : python scripts/train.py --seed 0 --max steps 30000 python scripts/eval accuracy.py --seeds 0 1 2 3 python scripts/eval fixedpoint.py --seeds 0 1 2 3 python scripts/plot.py The training and evaluation scripts are at https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts . What went wrong and was fixed - bf16 broke complex RoPE buffer : switched to fp32; fine at 3.4M parameters - Initial training-time max loop iters too small : kept at 4 per Saunshi's recipe; future experiments could vary this - Greedy generation is slow at high loop counts : each batch repeats n loops forward passes through the recurrent block; for loops=32 this is non-trivial Hyperparameter choices: why these - dim=256, expert dim=512, 1 prelude / 1 coda layer : smallest config that still exhibits looping behavior; matches Saunshi's scale - n experts=4 : enough to demonstrate MoE routing without bloating params - lora rank=8 : depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing - max seq len=32 : tight bound — d=5 addition fits in ~18 chars References OpenMythos GitHub Kye Gomez https://github.com/kyegomez/OpenMythos Claude Mythos Preview Anthropic, 2026-04-07 https://red.anthropic.com/2026/mythos-preview/ Project Glasswing https://www.anthropic.com/glasswing Reasoning with Latent Thoughts Saunshi et al., ICLR 2025 https://arxiv.org/abs/2502.17416 Scaling up Test-Time Compute with Latent Reasoning Geiping et al., Huginn https://arxiv.org/abs/2502.05171 Testing the OpenMythos Hypothesis Micheal Bee https://medium.com/@mbonsign/testing-the-openmythos-hypothesis-emergent-subspace-selectivity-in-looped-transformers-711f8ca0236c Parcae — Scaling Laws for Stable Looped Language Models https://arxiv.org/abs/2604.12946 Loop, Think, & Generalize Implicit Reasoning in Recurrent-Depth Transformers https://arxiv.org/abs/2604.07822 Tomorrow: Day 8 A follow-up to Day 7, pushing looped thinking one step further into something harder…