{"slug": "zero-weights-language-model-mse-glm", "title": "Zero Weights Language Model (MSE-GLM)", "summary": "Researchers introduced the Zero Weights Language Model (MSE-GLM), a graph-based architecture that represents language as a directed graph with no learned weights, gradients, or probability sampling. The model uses three compact matrices for deterministic, inspectable generation, targeting constrained domains like grammar-constrained generation and embedded AI where guarantees and reproducibility are critical.", "body_md": "## 1. Introduction\n\nMost language models are built around the same idea: train a neural network on enormous amounts of text, let it adjust billions of floating-point weights until it learns to predict the next word reasonably well, and then sample from a probability distribution at inference time. The model is powerful, but it is also a black box — you cannot point to the weight that caused a particular word to be chosen, and two runs with the same input can produce different output.\n\nThe **MSE Graph Language Model (MSE-GLM)** takes a different approach entirely.\nLanguage is represented as a directed graph: tokens are nodes, observed transitions are edges,\nand inference is graph traversal under a small set of explicit, inspectable rules.\nThere are no learned weights, no gradients, no probability sampling — and because of that,\nevery generation decision can be traced back to the exact rule and candidate set that produced it.\n\n**What this is not** MSE-GLM is not a transformer competitor for open-domain generation or reasoning. It is an architecture for settings where\n\n*guarantees*matter more than fluency — grammar-constrained generation, embedded AI, audit-trail-required tooling, and any pipeline where reproducibility is non-negotiable.\n\n## 2. Design Philosophy\n\nThe core bet is that language, in many constrained domains, does not need to be modeled probabilistically. If the valid output space is a finite set of token transitions — all valid SQL clauses, all valid JSON keys for a schema, all valid assembly mnemonics — then a graph that memorizes exactly those transitions can generate correctly constrained output with zero chance of emitting something it never observed, and zero need for a GPU.\n\nWhere the graph is genuinely ambiguous — two equally plausible next tokens given the same context — the architecture resolves that ambiguity using principled, inspectable rules rather than a probability sample. That is the core engineering problem this system solves, and the three-matrix design described below is how it does it.\n\n## 3. Architecture Overview\n\nTraining is a single O(N) pass over the corpus — no backpropagation, no epochs, no GPU. The trained model persists to a self-contained folder of JSON files (vocabulary, edges, bridges, relationships, metadata) that can be loaded and queried on any machine with Python.\n\n## 4. Tokenizer\n\nThe tokenizer is a from-scratch Byte Pair Encoding (BPE) implementation — the same approach used by GPT-2, but written from the ground up with no external dependencies. It converts raw text into integer token IDs through iterative character-pair merging.\n\nFour reserved special tokens anchor the system:\n\n| Token | ID | Role |\n|---|---|---|\n| <PAD> | 0 | Padding placeholder (reserved) |\n| <UNK> | 1 | Unknown character fallback |\n| <BOS> | 2 | Beginning of sequence — prepended to every prompt |\n| <EOS> | 3 | End of sequence — appended during training only |\n\nSentence boundaries (on `. ! ? \\n`\n\n) are preserved during training so the graph\nlearns where sequences legally end. Streaming training from a file is supported, so corpus\nsize is not bounded by available RAM.\n\n## 5. The Three Matrices\n\nThe trained model is three compact, array-backed, CSR-indexed structures.\nAll storage uses Python's `array.array('i')`\n\n— 4 bytes per integer,\nroughly 7× smaller than equivalent Python lists.\n\n### Edge Matrix (E)\n\nA deduplicated list of every adjacent token pair observed in the corpus, sorted by source token and indexed for O(1) successor lookup. This is the bigram graph — it answers \"what tokens have I seen follow token X?\"\n\n### Bridge Matrix (B)\n\nExtends E to three-token context: every observed `(source, bridge, target)`\n\ntriple is stored once, giving the model trigram-equivalent context with no attention\nmechanism. The key structural innovation is a fourth column, `cluster_id`\n\n,\ndescribed in the next section.\n\n### Relationship Matrix (R)\n\nThe most novel structure. It stores only two columns:\n\n| Column | Type | Meaning |\n|---|---|---|\n| triple_id | int | Foreign key into B — which bridge triple this row annotates |\n| relationship_id | int | Which training sentence produced this triple |\n\nR contains **no copy of the triple's own content**. It is a pure many-to-many\nedge list: a triple shared across multiple training sentences carries one R row per sentence.\nThis is what enables lineage-aware tie-breaking at inference time (see §7) and\nO(1) batch audit of \"every triple belonging to sentence N.\"\n\n```\n# Example R matrix — corpus: \"the cat sat on the mat.\" + \"the dog sat on the carpet.\"\ntriple_id  |  relationship_id\n-----------+-----------------\n    1      |       1           ← triple (the→sat, bridge=cat) appeared in sentence 1\n    2      |       1\n    3      |       1           ← triple (sat→the, bridge=on) — shared\n    4      |       1\n    5      |       2\n    6      |       2\n    3      |       2           ← same triple_id 3, now also in sentence 2\n    7      |       2\n```\n\n## 6. Distributional Clustering\n\nThe `cluster_id`\n\ncolumn in B is a weights-free, symbolic analogue of the\ndistributional hypothesis underlying word2vec and other embedding models: tokens that are\ninterchangeable in the same structural slot are functionally related, without any claim of\nshared meaning.\n\nClustering is assigned by a *dual-axis rule*:\n\n-\n**Bridge axis**— triples sharing the same`(source, target)`\n\npair get a shared`cluster_id`\n\n. This groups tokens interchangeable as the middle token (the \"bridge\"). -\n**Target axis**— triples sharing the same`(source, bridge)`\n\npair, not already grouped on the bridge axis, get a shared`cluster_id`\n\n. This groups tokens interchangeable as the destination. -\n**cluster_id = 0**— a triple that matches neither axis with any other triple is left unclustered.\n\n**Worked example**\n\n*the cat sat on the mat.*+\n\n*the dog sat on the carpet.*\n\n• Triples 1 and 5 share (source=the, target=sat) → bridge axis → cluster 1\n\nMembers:\n\n`cat`\n\nand `dog`\n\nare interchangeable bridges• Triples 4 and 7 share (source=on, bridge=the) → target axis → cluster 2\n\nMembers:\n\n`mat`\n\nand `carpet`\n\nare interchangeable targets• Triples 2, 3, 6 → cluster_id = 0 (no shared slot with another triple)\n\nA derived `T_index`\n\nmaps each token to the non-zero cluster IDs it participates\nin. Token relatedness is then defined as\n`|T_index[a] ∩ T_index[b]|`\n\n— a set-intersection analogue of cosine similarity\nthat needs no embedding space.\n\n### infer_shared_role()\n\nThis enables a new type of query: give the model an unordered set of tokens and ask what they have in common structurally:\n\n```\nmodel.infer_shared_role([\"cat\", \"dog\"])\n# → [('sat', 'bridge_axis', {'cluster_id': 1, 'overlap': 2}), ...]\n# cat and dog both filled the bridge slot in (the → ___ → sat)\n```\n\n## 7. Inference Pipeline\n\nAt every generation step the engine knows the **current token**,\nthe **previous token** (if available), and a running set\n`active_rels`\n\n— the relationship IDs the path has been consistent with so far.\nIt resolves the next token through four ordered stages:\n\n#### Exact Bridge Match\n\nFind all triples where source=previous and bridge=current. If multiple matches, prefer those whose R lineage intersects active_rels (lineage tie-break). Fall back to storage order only if no lineage match.\n\n#### Bridge Voting\n\nEvery bridge triple from previous casts a vote for its target. A triple whose bridge equals current is weighted 2×. Highest vote wins.\n\n#### Bigram Voting\n\nFall back to the Edge Matrix. Every observed successor of current receives an equal vote. Highest wins.\n\n#### Termination\n\nNo candidate found at any stage. Emit <EOS> and stop. Generation also stops if <EOS> itself is selected at any earlier stage.\n\n### The lineage-narrowing rule (critical detail)\n\nWhen a Stage 1 step is resolved, `active_rels`\n\nis updated by\n**intersecting** it with the chosen triple's lineage — not replacing it.\nThis is subtle but essential: a triple shared across several training sentences\n(e.g. a common clause like *sat on the*) must not erase the more specific\nlineage a prior step already established.\n\n```\n# BUG (old): replace active_rels entirely\nactive_rels = new_triple_rels  # ← wipes specificity at shared triples\n\n# FIX: narrow by intersection\nnarrowed = active_rels & new_triple_rels\nactive_rels = narrowed if narrowed else new_triple_rels  # reset only on divergence\n```\n\nWithout this fix, `the dog`\n\nwould generate *the dog sat on the mat*\ninstead of the correct\n\n*the dog sat on the*, because the shared\n\n**carpet*** sat → on → the*triple would widen\n\n`active_rels`\n\nback out to all lineages just before the branch point. This regression is covered\nby an automated test in the suite.\n## 8. Explainability\n\nEvery generation step is fully traceable. The `explain_step()`\n\nmethod and\n`/explain`\n\nchat command return the stage, rule, candidate set, and\n`active_rels`\n\nfor any given step:\n\n```\nyou> /explain the | dog\nmodel> next='sat'  stage=1  rule=storage_order_fallback  active_rels={1}\n\nyou> /explain sat | on\nmodel> next='the'  stage=1  rule=single_match  active_rels={1}\n\nyou> /explain on | the\nmodel> next='carpet'  stage=1  rule=lineage_match  active_rels={1}\n```\n\nThe `analyse.py`\n\nCLI provides full step-by-step traces from the command line:\n\n```\npython3 analyse.py --model runs/demo trace \"the dog\" --max-tokens 12\n```\n\nThe output names the chosen token, stage, rule, and active lineages at every step — making the model fully auditable without any post-hoc interpretation.\n\n## 9. MSE-GLM vs. Transformers\n\n| Property | MSE-GLM | Transformer |\n|---|---|---|\n| Weights | None | Billions of floats |\n| Training cost | O(N), one pass, CPU | GPU weeks / months |\n| Inference | Deterministic, O(out-degree) | Stochastic, O(seq²) |\n| Explainability | Full — every step traceable | Post-hoc approximation only |\n| Hallucination | Impossible for unseen transitions | Can and does |\n| Context length | (previous, current) + lineage | Thousands of tokens |\n| Semantic understanding | None | Strong |\n| Generalisation | None beyond training data | Strong |\n| RAM / GPU requirement | CPU only, array-backed | GPU required at scale |\n\nThe two architectures optimize for different things. Transformers win on fluency, generalization, and long-range reasoning. MSE-GLM wins on guarantees, auditability, and resource efficiency. They are not competitors — they are tools for different problems.\n\n## 10. Use Cases\n\n**Grammar-constrained generation**— SQL, JSON, config files, shell commands: any domain where the valid output space is closed and can be observed from a corpus.**LLM guardrails**— attach MSE-GLM as a structural validator on top of a transformer's output to catch illegal transitions before they reach users.**Embedded / edge AI**— no GPU, no framework, pure Python + stdlib. Deploy on a Raspberry Pi or a microcontroller-class device.** Compliance-sensitive tooling**— legal, finance, healthcare contexts where every output decision must be auditable by a human reviewer.** Autocomplete engines**— deterministic, fast, constrained to observed patterns.** Distributional similarity without embeddings**—`infer_shared_role()`\n\nidentifies which tokens are interchangeable in a given structural slot, with no embedding model required.\n\n## 11. Track Record\n\n#### Core architecture designed and implemented\n\nTokenizer (BPE, streaming), Edge Matrix, Bridge Matrix (trigram context), four-stage deterministic inference engine, model orchestrator, corpus analyser. Full SDD written and verified against the implementation.\n\n#### Lineage-aware tie-breaking\n\nThe R matrix (triple_id, relationship_id only — no content duplication) added to resolve Stage 1 ties by sequence lineage rather than arbitrary storage order. Batch audit of any training sentence added at O(1). SDD v2.1 addendum written.\n\n#### Distributional clustering without embeddings\n\ncluster_id column added to B via bridge-axis + target-axis rules. T_index built alongside for O(1) cluster membership lookup. infer_shared_role() inference mode added — verified live that cat+dog → sat, mat+carpet → (EOS/the).\n\n#### Critical regression caught and fixed\n\n\"the dog\" was generating \"the dog sat on the mat\" instead of carpet. Root cause: active_rels was replaced rather than narrowed at shared triples, losing dog-specific lineage at the branch point. Fixed via intersection. Dedicated regression test added.\n\nregression: confirmed fixed#### train.py, chat.py, production save/load\n\nSelf-contained model folder persistence (vocabulary.json, edges.json, bridges.json, relationships.json, meta.json). CLI training pipeline supporting inline text or streamed files. Interactive chat REPL with /explain, /shared, /clusters, /stats commands.\n\n#### analyse.py — 12 CLI subcommands\n\ncorpus, stats, topology, clusters, cluster <id>, relationships, relationship <id>, token, similarity, shared, trace, report — all with optional JSON export. No external dependencies.\n\n#### 56 / 56 automated tests passing\n\nFull regression suite on a 12-sentence multi-lineage corpus (cats/dogs/boys/girls, birds/planes, fish/ducks) verifying every layer: tokenizer, graph construction, R schema, lineage narrowing, determinism, explain_step(), infer_shared_role(), analyser reports, and save/load round-trip.\n\n✅ 56 / 56 passing### System Design Document v2.1\n\nFull architecture specification — 20 sections covering every component,\n\nworked examples, complexity analysis, and implementation notes.\n\n[View on GitHub](https://github.com/fodokidza/mse_glm)\n\n[Download SDD v2.1 (PDF)](?dl=sdd)\n\n## 12. Download & Run\n\nThe full implementation — tokenizer, graphs, inference engine, training CLI, chat REPL, analysis CLI, and test suite — is on GitHub. No pip installs required beyond the Python standard library.\n\n```\ngit clone https://github.com/fodokidza/mse_glm.git\ncd mse-glm\n\n# train a model\npython3 train.py --text \"your corpus here.\" --out runs/model --vocab-size 500\n\n# chat with it\npython3 chat.py --model runs/model\n\n# analyse it\npython3 analyse.py --model runs/model report\n\n# run the tests\npython3 test.py\n```\n\n© 2026 Clifford Chivhanga. Source code licensed under AGPL-3.0. Commercial licensing available — contact the author.", "url": "https://wpnews.pro/news/zero-weights-language-model-mse-glm", "canonical_source": "https://aircityshops.com/index.php?url=city/mse_blog", "published_at": "2026-06-29 19:06:41+00:00", "updated_at": "2026-06-29 19:20:08.378598+00:00", "lang": "en", "topics": ["large-language-models", "ai-research"], "entities": ["MSE-GLM", "Byte Pair Encoding", "GPT-2"], "alternates": {"html": "https://wpnews.pro/news/zero-weights-language-model-mse-glm", "markdown": "https://wpnews.pro/news/zero-weights-language-model-mse-glm.md", "text": "https://wpnews.pro/news/zero-weights-language-model-mse-glm.txt", "jsonld": "https://wpnews.pro/news/zero-weights-language-model-mse-glm.jsonld"}}