{"slug": "16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from", "title": "16 Days, 4.7M Params, Zero Black Boxes: Building a White-box Chinese Cognition Engine from Scratch", "summary": "A developer built a white-box Chinese language engine from scratch in 16 days, designing each of its 4.7 million parameters to have a specific linguistic function and be fully traceable. The project overcame multiple training collapses including mode collapse, repetition collapse, and a gradient chain break that froze the gate for 240 epochs, using a dual-agent debugging system with DeepSeek and Qwen.", "body_md": "**Author: Wei Jinqi | June 16, 2026**\n\nEvery time I use a large language model, the same thought nags at me: *I have no idea what's happening inside.*\n\n95% accuracy? Great. But which weights fired? What linguistic features were extracted? Did it confuse \"bank\" (river) with \"bank\" (financial)? Nobody knows.\n\nSo I spent 16 days building a Chinese language engine where **every weight has a reason and every decision is traceable**.\n\nInstead of training a transformer on terabytes of text and hoping it learns Chinese, I designed each module to handle a **specific linguistic function**:\n\n| Module | Function | Params |\n|---|---|---|\n| P1 | Char → Word encoding | 96K (frozen) |\n| P3-L | Multi-dimensional attribute annotation | 0 (rule engine) |\n| P7 | Cross-sentence word routing | 226K |\n| Explore+Meta | Learned gating over decode dims | 101K |\n| P6 | Sentence → Word sequence decoding | 4.37M |\n\nThe modules are chained: **P1 encodes → P7 routes → Gate modulates → P6 decodes**. Every intermediate state can be inspected.\n\nDay 1 was smooth. P1 (char→word encoder) and P3 (attribute stack — a rule engine that tags words with person/syntax/semantic/emotion/direction attributes) came together quickly.\n\nDay 2 introduced P7, the cross-sentence router. And **everything broke**.\n\nI used standard multi-head cross-attention. Every position — regardless of input — routed to the same output word. The dreaded **Mode Collapse**.\n\nWhat followed was seven failed fixes:\n\nThe breakthrough came when I noticed Q/K were eye-initialized, meaning each head saw only 1 dimension with zero discrimination power.\n\n**v8 (final)**: Xavier init for Q/K, eye init for V. Added an Explore network (loss → GELU MLP → 64D control signal) and a Meta network (signal + state → per-word gate). Mode collapse solved.\n\nDay 3 built P3-L: 23 groups, 312 independent attention heads, each controlling one attribute dimension. Combined training with P7 via UnifiedExplore→UnifiedMeta gate.\n\nDay 4 introduced **P6: the sentence→word decoder**. It was supposed to take a 256D sentence vector and output 16 distinct word embeddings.\n\nIt output the same word 16 times. The **Repetition Collapse** had begun.\n\nSix versions over two days:\n\n`h + pos_embed[i]`\n\nper head → The simplest fix won. Each head receives the same `h`\n\nbut adds a unique learned position embedding. No rep_pen. No residuals. No detach. Just position diversity.\n\nEpoch after epoch, the gate stayed frozen — all 256 dimensions had **std=0.0001**. Three bugs conspired:\n\n`explore_mod.weight`\n\nzero-initialized → identical signal per dim`p3l_act`\n\nzero-initialized → sigmoid(0)=0.5 for all dims`bias init scale=0.1`\n\ntoo small → output stuck at 0.5Then I found an even worse bug: `gate.item()`\n\nwas used in loss computation, converting a tensor to Python float — **severing the gradient chain**. The gate had been frozen for **240 epochs** without anyone noticing.\n\nFix: keep gate as tensor, let gradients flow back through explore and meta. Loss dropped from 0.56 to 0.28 in 3 epochs.\n\nI built a dual-agent debugging system: **DeepSeek (engineer)** proposes fixes, **Qwen (reviewer)** audits them. They debate until convergence.\n\nThe system diagnosed four major bugs, including the gradient chain break. It would have saved days if I'd built it earlier.\n\nKey improvements:\n\n`ord(c) > 32`\n\nfilter)| Bug | Symptom | Root Cause | Fix |\n|---|---|---|---|\n| Mode Collapse | All outputs = same word | Q/K eye-init, zero discrimination | Xavier init + diversity architecture |\n| Gate Symmetry Lock | All gate dims identical (std=0.0001) | Three zero-initializations | Proper random init for explore, act, bias |\n| Gradient Chain Break | Gate not learning for 240 epochs |\n`.item()` severed gradient |\nKeep as tensor |\n| Repetition Collapse | 16 heads → same word | Parallel heads share identical input | Position embedding V6 |\n| CUDA OOM | 25.76 GiB allocated | P1 full cross-attention | Batch encoding (50 words) |\n| Space Collapse | Model outputs spaces | HF data formatting |\n`ord(c) > 32` filter |\n| sent_vec Info Loss | Different sentences → similar vectors | Mean pooling | Learnable ±weighted sum |\n\n| Metric | Score |\n|---|---|\n| Word Accuracy | 92.4% |\n| Exact Match | 76.3% |\n| Rouge-L F1 | 93.2 |\n| Per-word Cosine | 0.96 |\n| Speed | 14ms/sent (71 sent/s) |\n\nEpoch 1 (from scratch, no pretraining): **43.5%** word accuracy on held-out exam set. Target: >95% after 1000 epochs.\n\nLLMs are powerful but opaque. When GPT makes a mistake, you can't trace which neurons fired wrong. With V19, you can:\n\nThis isn't about beating GPT. It's about building something **you can understand completely**.\n\n```\ngit clone https://github.com/Xuan-yi-yan/V18-cognitive-architecture\ncd V18-cognitive-architecture\npython download_public_data.py\npython train_v19_full.py --data public --epochs 1000 --display 10\n```\n\nFull model card and architecture docs on [Hugging Face](https://huggingface.co/).\n\n*16 days. 7 dead bugs. 4.7 million parameters. Zero black boxes.*\n\n*That's just how I like it.*", "url": "https://wpnews.pro/news/16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from", "canonical_source": "https://dev.to/xuanyiyan/16-days-47m-params-zero-black-boxes-building-a-white-box-chinese-cognition-engine-from-scratch-503m", "published_at": "2026-06-16 14:07:46+00:00", "updated_at": "2026-06-16 14:17:18.385033+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-research", "developer-tools"], "entities": ["Wei Jinqi", "DeepSeek", "Qwen"], "alternates": {"html": "https://wpnews.pro/news/16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from", "markdown": "https://wpnews.pro/news/16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from.md", "text": "https://wpnews.pro/news/16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from.txt", "jsonld": "https://wpnews.pro/news/16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from.jsonld"}}