16 Days, 4.7M Params, Zero Black Boxes: Building a White-box Chinese Cognition Engine from Scratch

wpnews.pro

cd /news/large-language-models/16-days-4-7m-params-zero-black-boxes… · home › topics › large-language-models › article

[ARTICLE · art-29585] src=dev.to ↗ pub=2026-06-16T14:07Z topic=large-language-models verified=true sentiment=↑ positive

16 Days, 4.7M Params, Zero Black Boxes: Building a White-box Chinese Cognition Engine from Scratch

A developer built a white-box Chinese language engine from scratch in 16 days, designing each of its 4.7 million parameters to have a specific linguistic function and be fully traceable. The project overcame multiple training collapses including mode collapse, repetition collapse, and a gradient chain break that froze the gate for 240 epochs, using a dual-agent debugging system with DeepSeek and Qwen.

read4 min views18 publishedJun 16, 2026

Author: Wei Jinqi | June 16, 2026

Every time I use a large language model, the same thought nags at me: I have no idea what's happening inside.

95% accuracy? Great. But which weights fired? What linguistic features were extracted? Did it confuse "bank" (river) with "bank" (financial)? Nobody knows.

So I spent 16 days building a Chinese language engine where every weight has a reason and every decision is traceable.

Instead of training a transformer on terabytes of text and hoping it learns Chinese, I designed each module to handle a specific linguistic function:

Module	Function	Params
P1	Char → Word encoding	96K (frozen)
P3-L	Multi-dimensional attribute annotation	0 (rule engine)
P7	Cross-sentence word routing	226K
Explore+Meta	Learned gating over decode dims	101K
P6	Sentence → Word sequence decoding	4.37M

The modules are chained: P1 encodes → P7 routes → Gate modulates → P6 decodes. Every intermediate state can be inspected.

Day 1 was smooth. P1 (char→word encoder) and P3 (attribute stack — a rule engine that tags words with person/syntax/semantic/emotion/direction attributes) came together quickly.

Day 2 introduced P7, the cross-sentence router. And everything broke.

I used standard multi-head cross-attention. Every position — regardless of input — routed to the same output word. The dreaded Mode Collapse.

What followed was seven failed fixes:

The breakthrough came when I noticed Q/K were eye-initialized, meaning each head saw only 1 dimension with zero discrimination power.

v8 (final): Xavier init for Q/K, eye init for V. Added an Explore network (loss → GELU MLP → 64D control signal) and a Meta network (signal + state → per-word gate). Mode collapse solved.

Day 3 built P3-L: 23 groups, 312 independent attention heads, each controlling one attribute dimension. Combined training with P7 via UnifiedExplore→UnifiedMeta gate.

Day 4 introduced P6: the sentence→word decoder. It was supposed to take a 256D sentence vector and output 16 distinct word embeddings.

It output the same word 16 times. The Repetition Collapse had begun.

Six versions over two days:

h + pos_embed[i]

per head → The simplest fix won. Each head receives the same h

but adds a unique learned position embedding. No rep_pen. No residuals. No detach. Just position diversity.

Epoch after epoch, the gate stayed frozen — all 256 dimensions had std=0.0001. Three bugs conspired:

explore_mod.weight

zero-initialized → identical signal per dimp3l_act

zero-initialized → sigmoid(0)=0.5 for all dimsbias init scale=0.1

too small → output stuck at 0.5Then I found an even worse bug: gate.item()

was used in loss computation, converting a tensor to Python float — severing the gradient chain. The gate had been frozen for 240 epochs without anyone noticing.

Fix: keep gate as tensor, let gradients flow back through explore and meta. Loss dropped from 0.56 to 0.28 in 3 epochs.

I built a dual-agent debugging system: DeepSeek (engineer) proposes fixes, Qwen (reviewer) audits them. They debate until convergence.

The system diagnosed four major bugs, including the gradient chain break. It would have saved days if I'd built it earlier.

Key improvements:

ord(c) > 32

filter)| Bug | Symptom | Root Cause | Fix | |---|---|---|---| | Mode Collapse | All outputs = same word | Q/K eye-init, zero discrimination | Xavier init + diversity architecture | | Gate Symmetry Lock | All gate dims identical (std=0.0001) | Three zero-initializations | Proper random init for explore, act, bias | | Gradient Chain Break | Gate not learning for 240 epochs | .item() severed gradient | Keep as tensor | | Repetition Collapse | 16 heads → same word | Parallel heads share identical input | Position embedding V6 | | CUDA OOM | 25.76 GiB allocated | P1 full cross-attention | Batch encoding (50 words) | | Space Collapse | Model outputs spaces | HF data formatting | ord(c) > 32 filter | | sent_vec Info Loss | Different sentences → similar vectors | Mean pooling | Learnable ±weighted sum |

Metric	Score
Word Accuracy	92.4%
Exact Match	76.3%
Rouge-L F1	93.2
Per-word Cosine	0.96
Speed	14ms/sent (71 sent/s)

Epoch 1 (from scratch, no pretraining): 43.5% word accuracy on held-out exam set. Target: >95% after 1000 epochs.

LLMs are powerful but opaque. When GPT makes a mistake, you can't trace which neurons fired wrong. With V19, you can:

This isn't about beating GPT. It's about building something you can understand completely.

git clone https://github.com/Xuan-yi-yan/V18-cognitive-architecture
cd V18-cognitive-architecture
python download_public_data.py
python train_v19_full.py --data public --epochs 1000 --display 10

Full model card and architecture docs on Hugging Face.

16 days. 7 dead bugs. 4.7 million parameters. Zero black boxes.

That's just how I like it.

source & further reading

dev.to — original article Why INT4 Weight-Only Quantization Doesn't Speed Up Prefill Building a Client-Side ATS Resume Analyzer with Next.js 14 and TypeScript My AI makes YouTube videos. It's only allowed to publish after 21 automated checks.

~/api · this article 200

$curl api.wpnews.pro/v1/news/16-days-4-7m-params-zero…

Read original on dev.to → dev.to/xuanyiyan/16-days-47m-params-zero-black-b…

mentioned entities

Wei Jinqi

DeepSeek

Qwen

metadata

slug16-days-4-7m-params-zero-black-boxes-building-a-white-box-chinese-cognition-from

topic#large-language-models

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevShow HN: VibeSH – Hallucinated T…

next →tracesage: See Inside Your LangG…

── more in #large-language-models 4 stories · sorted by recency

runtimewire.com · 31 Jul · #large-language-models

Huawei releases 505B-parameter openPangu model trained on Ascend chips

insideai.news · 31 Jul · #large-language-models

OpenAI Bans China-Linked Accounts Building AI Surveillance Tool

dev.to · 31 Jul · #large-language-models

Qwen2.5-Coder vs DeepSeek-Coder for Solidity Review: What I Actually See Locally

getreadyforagents.com · 31 Jul · #large-language-models

Researchers distill DeepSeek V4 Flash into open-source model for finance reasoning tasks

── more on @wei jinqi 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 31 Jul · #artificial-intelligence

OpenAI Slashes GPT-5.6 Prices as Tech Giants Wage War Over Enterprise AI Spending

wpnews · 31 Jul · #artificial-intelligence

Microsoft doubles down on multi-model AI as it builds a Copilot super app

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required