Zero Weights Language Model (MSE-GLM)

wpnews.pro

1. Introduction #

Most language models are built around the same idea: train a neural network on enormous amounts of text, let it adjust billions of floating-point weights until it learns to predict the next word reasonably well, and then sample from a probability distribution at inference time. The model is powerful, but it is also a black box — you cannot point to the weight that caused a particular word to be chosen, and two runs with the same input can produce different output.

The MSE Graph Language Model (MSE-GLM) takes a different approach entirely. Language is represented as a directed graph: tokens are nodes, observed transitions are edges, and inference is graph traversal under a small set of explicit, inspectable rules. There are no learned weights, no gradients, no probability sampling — and because of that, every generation decision can be traced back to the exact rule and candidate set that produced it.

What this is not MSE-GLM is not a transformer competitor for open-domain generation or reasoning. It is an architecture for settings where

guaranteesmatter more than fluency — grammar-constrained generation, embedded AI, audit-trail-required tooling, and any pipeline where reproducibility is non-negotiable.

2. Design Philosophy #

The core bet is that language, in many constrained domains, does not need to be modeled probabilistically. If the valid output space is a finite set of token transitions — all valid SQL clauses, all valid JSON keys for a schema, all valid assembly mnemonics — then a graph that memorizes exactly those transitions can generate correctly constrained output with zero chance of emitting something it never observed, and zero need for a GPU.

Where the graph is genuinely ambiguous — two equally plausible next tokens given the same context — the architecture resolves that ambiguity using principled, inspectable rules rather than a probability sample. That is the core engineering problem this system solves, and the three-matrix design described below is how it does it.

3. Architecture Overview #

Training is a single O(N) pass over the corpus — no backpropagation, no epochs, no GPU. The trained model persists to a self-contained folder of JSON files (vocabulary, edges, bridges, relationships, metadata) that can be loaded and queried on any machine with Python.

4. Tokenizer #

The tokenizer is a from-scratch Byte Pair Encoding (BPE) implementation — the same approach used by GPT-2, but written from the ground up with no external dependencies. It converts raw text into integer token IDs through iterative character-pair merging.

Four reserved special tokens anchor the system:

Token	ID	Role
<PAD>	0	Padding placeholder (reserved)
<UNK>	1	Unknown character fallback
<BOS>	2	Beginning of sequence — prepended to every prompt
<EOS>	3	End of sequence — appended during training only

Sentence boundaries (on . ! ? \n

) are preserved during training so the graph learns where sequences legally end. Streaming training from a file is supported, so corpus size is not bounded by available RAM.

5. The Three Matrices #

The trained model is three compact, array-backed, CSR-indexed structures. All storage uses Python's array.array('i')

— 4 bytes per integer, roughly 7× smaller than equivalent Python lists.

Edge Matrix (E)

A deduplicated list of every adjacent token pair observed in the corpus, sorted by source token and indexed for O(1) successor lookup. This is the bigram graph — it answers "what tokens have I seen follow token X?"

Bridge Matrix (B)

Extends E to three-token context: every observed (source, bridge, target)

triple is stored once, giving the model trigram-equivalent context with no attention mechanism. The key structural innovation is a fourth column, cluster_id

, described in the next section.

Relationship Matrix (R)

The most novel structure. It stores only two columns:

Column	Type	Meaning
triple_id	int	Foreign key into B — which bridge triple this row annotates
relationship_id	int	Which training sentence produced this triple

R contains no copy of the triple's own content. It is a pure many-to-many edge list: a triple shared across multiple training sentences carries one R row per sentence. This is what enables lineage-aware tie-breaking at inference time (see §7) and O(1) batch audit of "every triple belonging to sentence N."

triple_id  |  relationship_id
-----------+-----------------
    1      |       1           ← triple (the→sat, bridge=cat) appeared in sentence 1
    2      |       1
    3      |       1           ← triple (sat→the, bridge=on) — shared
    4      |       1
    5      |       2
    6      |       2
    3      |       2           ← same triple_id 3, now also in sentence 2
    7      |       2

6. Distributional Clustering #

The cluster_id

column in B is a weights-free, symbolic analogue of the distributional hypothesis underlying word2vec and other embedding models: tokens that are interchangeable in the same structural slot are functionally related, without any claim of shared meaning.

Clustering is assigned by a dual-axis rule:

Bridge axis— triples sharing the same(source, target)

pair get a sharedcluster_id

. This groups tokens interchangeable as the middle token (the "bridge"). - Target axis— triples sharing the same(source, bridge)

pair, not already grouped on the bridge axis, get a sharedcluster_id

. This groups tokens interchangeable as the destination. - cluster_id = 0— a triple that matches neither axis with any other triple is left unclustered.

Worked example

the cat sat on the mat.+

the dog sat on the carpet.

• Triples 1 and 5 share (source=the, target=sat) → bridge axis → cluster 1

Members:

cat

and dog

are interchangeable bridges• Triples 4 and 7 share (source=on, bridge=the) → target axis → cluster 2

Members:

mat

and carpet

are interchangeable targets• Triples 2, 3, 6 → cluster_id = 0 (no shared slot with another triple)

A derived T_index

maps each token to the non-zero cluster IDs it participates in. Token relatedness is then defined as |T_index[a] ∩ T_index[b]|

— a set-intersection analogue of cosine similarity that needs no embedding space.

infer_shared_role()

This enables a new type of query: give the model an unordered set of tokens and ask what they have in common structurally:

model.infer_shared_role(["cat", "dog"])

7. Inference Pipeline #

At every generation step the engine knows the current token, the previous token (if available), and a running set active_rels

— the relationship IDs the path has been consistent with so far. It resolves the next token through four ordered stages:

Exact Bridge Match

Find all triples where source=previous and bridge=current. If multiple matches, prefer those whose R lineage intersects active_rels (lineage tie-break). Fall back to storage order only if no lineage match.

Bridge Voting

Every bridge triple from previous casts a vote for its target. A triple whose bridge equals current is weighted 2×. Highest vote wins.

Bigram Voting

Fall back to the Edge Matrix. Every observed successor of current receives an equal vote. Highest wins.

Termination

No candidate found at any stage. Emit <EOS> and stop. Generation also stops if <EOS> itself is selected at any earlier stage.

The lineage-narrowing rule (critical detail)

When a Stage 1 step is resolved, active_rels

is updated by intersecting it with the chosen triple's lineage — not replacing it. This is subtle but essential: a triple shared across several training sentences (e.g. a common clause like sat on the) must not erase the more specific lineage a prior step already established.

active_rels = new_triple_rels  # ← wipes specificity at shared triples

narrowed = active_rels & new_triple_rels
active_rels = narrowed if narrowed else new_triple_rels  # reset only on divergence

Without this fix, the dog

would generate the dog sat on the mat instead of the correct

the dog sat on the, because the shared

carpet* sat → on → the*triple would widen

active_rels

back out to all lineages just before the branch point. This regression is covered by an automated test in the suite.

8. Explainability #

Every generation step is fully traceable. The explain_step()

method and /explain

chat command return the stage, rule, candidate set, and active_rels

for any given step:

you> /explain the | dog
model> next='sat'  stage=1  rule=storage_order_fallback  active_rels={1}

you> /explain sat | on
model> next='the'  stage=1  rule=single_match  active_rels={1}

you> /explain on | the
model> next='carpet'  stage=1  rule=lineage_match  active_rels={1}

The analyse.py

CLI provides full step-by-step traces from the command line:

python3 analyse.py --model runs/demo trace "the dog" --max-tokens 12

The output names the chosen token, stage, rule, and active lineages at every step — making the model fully auditable without any post-hoc interpretation.

9. MSE-GLM vs. Transformers #

Property	MSE-GLM	Transformer
Weights	None	Billions of floats
Training cost	O(N), one pass, CPU	GPU weeks / months
Inference	Deterministic, O(out-degree)	Stochastic, O(seq²)
Explainability	Full — every step traceable	Post-hoc approximation only
Hallucination	Impossible for unseen transitions	Can and does
Context length	(previous, current) + lineage	Thousands of tokens
Semantic understanding	None	Strong
Generalisation	None beyond training data	Strong
RAM / GPU requirement	CPU only, array-backed	GPU required at scale

The two architectures optimize for different things. Transformers win on fluency, generalization, and long-range reasoning. MSE-GLM wins on guarantees, auditability, and resource efficiency. They are not competitors — they are tools for different problems.

10. Use Cases #

Grammar-constrained generation— SQL, JSON, config files, shell commands: any domain where the valid output space is closed and can be observed from a corpus.LLM guardrails— attach MSE-GLM as a structural validator on top of a transformer's output to catch illegal transitions before they reach users.Embedded / edge AI— no GPU, no framework, pure Python + stdlib. Deploy on a Raspberry Pi or a microcontroller-class device.** Compliance-sensitive tooling**— legal, finance, healthcare contexts where every output decision must be auditable by a human reviewer.** Autocomplete engines**— deterministic, fast, constrained to observed patterns.** Distributional similarity without embeddings**—infer_shared_role()

identifies which tokens are interchangeable in a given structural slot, with no embedding model required.

11. Track Record #

Core architecture designed and implemented

Tokenizer (BPE, streaming), Edge Matrix, Bridge Matrix (trigram context), four-stage deterministic inference engine, model orchestrator, corpus analyser. Full SDD written and verified against the implementation.

Lineage-aware tie-breaking

The R matrix (triple_id, relationship_id only — no content duplication) added to resolve Stage 1 ties by sequence lineage rather than arbitrary storage order. Batch audit of any training sentence added at O(1). SDD v2.1 addendum written.

Distributional clustering without embeddings

cluster_id column added to B via bridge-axis + target-axis rules. T_index built alongside for O(1) cluster membership lookup. infer_shared_role() inference mode added — verified live that cat+dog → sat, mat+carpet → (EOS/the).

Critical regression caught and fixed

"the dog" was generating "the dog sat on the mat" instead of carpet. Root cause: active_rels was replaced rather than narrowed at shared triples, losing dog-specific lineage at the branch point. Fixed via intersection. Dedicated regression test added.

regression: confirmed fixed#### train.py, chat.py, production save/load

Self-contained model folder persistence (vocabulary.json, edges.json, bridges.json, relationships.json, meta.json). CLI training pipeline supporting inline text or streamed files. Interactive chat REPL with /explain, /shared, /clusters, /stats commands.

analyse.py — 12 CLI subcommands

corpus, stats, topology, clusters, cluster <id>, relationships, relationship <id>, token, similarity, shared, trace, report — all with optional JSON export. No external dependencies.

56 / 56 automated tests passing

Full regression suite on a 12-sentence multi-lineage corpus (cats/dogs/boys/girls, birds/planes, fish/ducks) verifying every layer: tokenizer, graph construction, R schema, lineage narrowing, determinism, explain_step(), infer_shared_role(), analyser reports, and save/load round-trip.

✅ 56 / 56 passing### System Design Document v2.1

Full architecture specification — 20 sections covering every component,

worked examples, complexity analysis, and implementation notes.

View on GitHub

Download SDD v2.1 (PDF)

12. Download & Run #

The full implementation — tokenizer, graphs, inference engine, training CLI, chat REPL, analysis CLI, and test suite — is on GitHub. No pip installs required beyond the Python standard library.

git clone https://github.com/fodokidza/mse_glm.git
cd mse-glm

python3 train.py --text "your corpus here." --out runs/model --vocab-size 500

python3 chat.py --model runs/model

python3 analyse.py --model runs/model report

python3 test.py

source & further reading

aircityshops.com — original article