{"slug": "olmo-core-engram-graft-small-scale-debug-comparison", "title": "OLMo-core + Engram graft: small-scale debug comparison", "summary": "A debug comparison between a base OLMo3 600M model and an Engram memory variant showed the grafted model achieved lower training and evaluation cross-entropy loss and faster gradient norm stabilization, indicating successful integration and improved early learning behavior.", "body_md": "I ran a `200-step`\n\n, with `global_batch _size =32`\n\ndebug comparison between a base OLMo3 600M model and the same dense backbone with a DeepSeek-style Engram memory graft.\n\nThe goal was to check whether the custom module was wired correctly, whether FSDP/HSDP wrapping and optimizer handling were stable, and whether the training/eval curves looked coherent.\n\nBase model:\n\nEngram variant:\n\n~1.7B trainable parameters\n\nEngram injected into layers 1 and 5\n\nMost added parameters come from sparse/hash-memory capacity, so total parameter count is not an apples-to-apples proxy for dense active compute.\n\nBoth are trained with [Dion](https://github.com/microsoft/dion/)[ ](https://github.com/microsoft/dion/)optimizer.\n\nUnder the same short debug setup, the Engram variant showed:\n\nlower train CE loss\n\nlower eval CE loss / PPL\n\nslightly faster grad-norm stabilization\n\nThe early signal is encouraging: the Engram graft is training-shaped, stable, and appears to improve early learning behavior in this setup.\n\nCustom architecture work is not just “does the forward pass run?”\n\nFor this integration, the parameter hierarchy, wrapping policy, optimizer handling, memory profile, and training curves all had to line up. Earlier versions trained mathematically, but had poor memory behavior because the custom modules were not placed inside the wrapped block hierarchy.\n\nW&B logs: [Weights & Biases](https://wandb.ai/jenwei0312/olmo3-engram-experiments)", "url": "https://wpnews.pro/news/olmo-core-engram-graft-small-scale-debug-comparison", "canonical_source": "https://discuss.huggingface.co/t/olmo-core-engram-graft-small-scale-debug-comparison/177045#post_1", "published_at": "2026-06-21 19:13:50+00:00", "updated_at": "2026-06-21 19:41:57.594549+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-infrastructure"], "entities": ["OLMo", "Engram", "DeepSeek", "Dion", "Microsoft", "Weights & Biases"], "alternates": {"html": "https://wpnews.pro/news/olmo-core-engram-graft-small-scale-debug-comparison", "markdown": "https://wpnews.pro/news/olmo-core-engram-graft-small-scale-debug-comparison.md", "text": "https://wpnews.pro/news/olmo-core-engram-graft-small-scale-debug-comparison.txt", "jsonld": "https://wpnews.pro/news/olmo-core-engram-graft-small-scale-debug-comparison.jsonld"}}