# Zero Weights Language Model (MSE-GLM)

> Source: <https://aircityshops.com/index.php?url=city/mse_blog>
> Published: 2026-06-29 19:06:41+00:00

## 1. Introduction

Most language models are built around the same idea: train a neural network on enormous amounts of text, let it adjust billions of floating-point weights until it learns to predict the next word reasonably well, and then sample from a probability distribution at inference time. The model is powerful, but it is also a black box — you cannot point to the weight that caused a particular word to be chosen, and two runs with the same input can produce different output.

The **MSE Graph Language Model (MSE-GLM)** takes a different approach entirely.
Language is represented as a directed graph: tokens are nodes, observed transitions are edges,
and inference is graph traversal under a small set of explicit, inspectable rules.
There are no learned weights, no gradients, no probability sampling — and because of that,
every generation decision can be traced back to the exact rule and candidate set that produced it.

**What this is not** MSE-GLM is not a transformer competitor for open-domain generation or reasoning. It is an architecture for settings where

*guarantees*matter more than fluency — grammar-constrained generation, embedded AI, audit-trail-required tooling, and any pipeline where reproducibility is non-negotiable.

## 2. Design Philosophy

The core bet is that language, in many constrained domains, does not need to be modeled probabilistically. If the valid output space is a finite set of token transitions — all valid SQL clauses, all valid JSON keys for a schema, all valid assembly mnemonics — then a graph that memorizes exactly those transitions can generate correctly constrained output with zero chance of emitting something it never observed, and zero need for a GPU.

Where the graph is genuinely ambiguous — two equally plausible next tokens given the same context — the architecture resolves that ambiguity using principled, inspectable rules rather than a probability sample. That is the core engineering problem this system solves, and the three-matrix design described below is how it does it.

## 3. Architecture Overview

Training is a single O(N) pass over the corpus — no backpropagation, no epochs, no GPU. The trained model persists to a self-contained folder of JSON files (vocabulary, edges, bridges, relationships, metadata) that can be loaded and queried on any machine with Python.

## 4. Tokenizer

The tokenizer is a from-scratch Byte Pair Encoding (BPE) implementation — the same approach used by GPT-2, but written from the ground up with no external dependencies. It converts raw text into integer token IDs through iterative character-pair merging.

Four reserved special tokens anchor the system:

| Token | ID | Role |
|---|---|---|
| <PAD> | 0 | Padding placeholder (reserved) |
| <UNK> | 1 | Unknown character fallback |
| <BOS> | 2 | Beginning of sequence — prepended to every prompt |
| <EOS> | 3 | End of sequence — appended during training only |

Sentence boundaries (on `. ! ? \n`

) are preserved during training so the graph
learns where sequences legally end. Streaming training from a file is supported, so corpus
size is not bounded by available RAM.

## 5. The Three Matrices

The trained model is three compact, array-backed, CSR-indexed structures.
All storage uses Python's `array.array('i')`

— 4 bytes per integer,
roughly 7× smaller than equivalent Python lists.

### Edge Matrix (E)

A deduplicated list of every adjacent token pair observed in the corpus, sorted by source token and indexed for O(1) successor lookup. This is the bigram graph — it answers "what tokens have I seen follow token X?"

### Bridge Matrix (B)

Extends E to three-token context: every observed `(source, bridge, target)`

triple is stored once, giving the model trigram-equivalent context with no attention
mechanism. The key structural innovation is a fourth column, `cluster_id`

,
described in the next section.

### Relationship Matrix (R)

The most novel structure. It stores only two columns:

| Column | Type | Meaning |
|---|---|---|
| triple_id | int | Foreign key into B — which bridge triple this row annotates |
| relationship_id | int | Which training sentence produced this triple |

R contains **no copy of the triple's own content**. It is a pure many-to-many
edge list: a triple shared across multiple training sentences carries one R row per sentence.
This is what enables lineage-aware tie-breaking at inference time (see §7) and
O(1) batch audit of "every triple belonging to sentence N."

```
# Example R matrix — corpus: "the cat sat on the mat." + "the dog sat on the carpet."
triple_id  |  relationship_id
-----------+-----------------
    1      |       1           ← triple (the→sat, bridge=cat) appeared in sentence 1
    2      |       1
    3      |       1           ← triple (sat→the, bridge=on) — shared
    4      |       1
    5      |       2
    6      |       2
    3      |       2           ← same triple_id 3, now also in sentence 2
    7      |       2
```

## 6. Distributional Clustering

The `cluster_id`

column in B is a weights-free, symbolic analogue of the
distributional hypothesis underlying word2vec and other embedding models: tokens that are
interchangeable in the same structural slot are functionally related, without any claim of
shared meaning.

Clustering is assigned by a *dual-axis rule*:

-
**Bridge axis**— triples sharing the same`(source, target)`

pair get a shared`cluster_id`

. This groups tokens interchangeable as the middle token (the "bridge"). -
**Target axis**— triples sharing the same`(source, bridge)`

pair, not already grouped on the bridge axis, get a shared`cluster_id`

. This groups tokens interchangeable as the destination. -
**cluster_id = 0**— a triple that matches neither axis with any other triple is left unclustered.

**Worked example**

*the cat sat on the mat.*+

*the dog sat on the carpet.*

• Triples 1 and 5 share (source=the, target=sat) → bridge axis → cluster 1

Members:

`cat`

and `dog`

are interchangeable bridges• Triples 4 and 7 share (source=on, bridge=the) → target axis → cluster 2

Members:

`mat`

and `carpet`

are interchangeable targets• Triples 2, 3, 6 → cluster_id = 0 (no shared slot with another triple)

A derived `T_index`

maps each token to the non-zero cluster IDs it participates
in. Token relatedness is then defined as
`|T_index[a] ∩ T_index[b]|`

— a set-intersection analogue of cosine similarity
that needs no embedding space.

### infer_shared_role()

This enables a new type of query: give the model an unordered set of tokens and ask what they have in common structurally:

```
model.infer_shared_role(["cat", "dog"])
# → [('sat', 'bridge_axis', {'cluster_id': 1, 'overlap': 2}), ...]
# cat and dog both filled the bridge slot in (the → ___ → sat)
```

## 7. Inference Pipeline

At every generation step the engine knows the **current token**,
the **previous token** (if available), and a running set
`active_rels`

— the relationship IDs the path has been consistent with so far.
It resolves the next token through four ordered stages:

#### Exact Bridge Match

Find all triples where source=previous and bridge=current. If multiple matches, prefer those whose R lineage intersects active_rels (lineage tie-break). Fall back to storage order only if no lineage match.

#### Bridge Voting

Every bridge triple from previous casts a vote for its target. A triple whose bridge equals current is weighted 2×. Highest vote wins.

#### Bigram Voting

Fall back to the Edge Matrix. Every observed successor of current receives an equal vote. Highest wins.

#### Termination

No candidate found at any stage. Emit <EOS> and stop. Generation also stops if <EOS> itself is selected at any earlier stage.

### The lineage-narrowing rule (critical detail)

When a Stage 1 step is resolved, `active_rels`

is updated by
**intersecting** it with the chosen triple's lineage — not replacing it.
This is subtle but essential: a triple shared across several training sentences
(e.g. a common clause like *sat on the*) must not erase the more specific
lineage a prior step already established.

```
# BUG (old): replace active_rels entirely
active_rels = new_triple_rels  # ← wipes specificity at shared triples

# FIX: narrow by intersection
narrowed = active_rels & new_triple_rels
active_rels = narrowed if narrowed else new_triple_rels  # reset only on divergence
```

Without this fix, `the dog`

would generate *the dog sat on the mat*
instead of the correct

*the dog sat on the*, because the shared

**carpet*** sat → on → the*triple would widen

`active_rels`

back out to all lineages just before the branch point. This regression is covered
by an automated test in the suite.
## 8. Explainability

Every generation step is fully traceable. The `explain_step()`

method and
`/explain`

chat command return the stage, rule, candidate set, and
`active_rels`

for any given step:

```
you> /explain the | dog
model> next='sat'  stage=1  rule=storage_order_fallback  active_rels={1}

you> /explain sat | on
model> next='the'  stage=1  rule=single_match  active_rels={1}

you> /explain on | the
model> next='carpet'  stage=1  rule=lineage_match  active_rels={1}
```

The `analyse.py`

CLI provides full step-by-step traces from the command line:

```
python3 analyse.py --model runs/demo trace "the dog" --max-tokens 12
```

The output names the chosen token, stage, rule, and active lineages at every step — making the model fully auditable without any post-hoc interpretation.

## 9. MSE-GLM vs. Transformers

| Property | MSE-GLM | Transformer |
|---|---|---|
| Weights | None | Billions of floats |
| Training cost | O(N), one pass, CPU | GPU weeks / months |
| Inference | Deterministic, O(out-degree) | Stochastic, O(seq²) |
| Explainability | Full — every step traceable | Post-hoc approximation only |
| Hallucination | Impossible for unseen transitions | Can and does |
| Context length | (previous, current) + lineage | Thousands of tokens |
| Semantic understanding | None | Strong |
| Generalisation | None beyond training data | Strong |
| RAM / GPU requirement | CPU only, array-backed | GPU required at scale |

The two architectures optimize for different things. Transformers win on fluency, generalization, and long-range reasoning. MSE-GLM wins on guarantees, auditability, and resource efficiency. They are not competitors — they are tools for different problems.

## 10. Use Cases

**Grammar-constrained generation**— SQL, JSON, config files, shell commands: any domain where the valid output space is closed and can be observed from a corpus.**LLM guardrails**— attach MSE-GLM as a structural validator on top of a transformer's output to catch illegal transitions before they reach users.**Embedded / edge AI**— no GPU, no framework, pure Python + stdlib. Deploy on a Raspberry Pi or a microcontroller-class device.** Compliance-sensitive tooling**— legal, finance, healthcare contexts where every output decision must be auditable by a human reviewer.** Autocomplete engines**— deterministic, fast, constrained to observed patterns.** Distributional similarity without embeddings**—`infer_shared_role()`

identifies which tokens are interchangeable in a given structural slot, with no embedding model required.

## 11. Track Record

#### Core architecture designed and implemented

Tokenizer (BPE, streaming), Edge Matrix, Bridge Matrix (trigram context), four-stage deterministic inference engine, model orchestrator, corpus analyser. Full SDD written and verified against the implementation.

#### Lineage-aware tie-breaking

The R matrix (triple_id, relationship_id only — no content duplication) added to resolve Stage 1 ties by sequence lineage rather than arbitrary storage order. Batch audit of any training sentence added at O(1). SDD v2.1 addendum written.

#### Distributional clustering without embeddings

cluster_id column added to B via bridge-axis + target-axis rules. T_index built alongside for O(1) cluster membership lookup. infer_shared_role() inference mode added — verified live that cat+dog → sat, mat+carpet → (EOS/the).

#### Critical regression caught and fixed

"the dog" was generating "the dog sat on the mat" instead of carpet. Root cause: active_rels was replaced rather than narrowed at shared triples, losing dog-specific lineage at the branch point. Fixed via intersection. Dedicated regression test added.

regression: confirmed fixed#### train.py, chat.py, production save/load

Self-contained model folder persistence (vocabulary.json, edges.json, bridges.json, relationships.json, meta.json). CLI training pipeline supporting inline text or streamed files. Interactive chat REPL with /explain, /shared, /clusters, /stats commands.

#### analyse.py — 12 CLI subcommands

corpus, stats, topology, clusters, cluster <id>, relationships, relationship <id>, token, similarity, shared, trace, report — all with optional JSON export. No external dependencies.

#### 56 / 56 automated tests passing

Full regression suite on a 12-sentence multi-lineage corpus (cats/dogs/boys/girls, birds/planes, fish/ducks) verifying every layer: tokenizer, graph construction, R schema, lineage narrowing, determinism, explain_step(), infer_shared_role(), analyser reports, and save/load round-trip.

✅ 56 / 56 passing### System Design Document v2.1

Full architecture specification — 20 sections covering every component,

worked examples, complexity analysis, and implementation notes.

[View on GitHub](https://github.com/fodokidza/mse_glm)

[Download SDD v2.1 (PDF)](?dl=sdd)

## 12. Download & Run

The full implementation — tokenizer, graphs, inference engine, training CLI, chat REPL, analysis CLI, and test suite — is on GitHub. No pip installs required beyond the Python standard library.

```
git clone https://github.com/fodokidza/mse_glm.git
cd mse-glm

# train a model
python3 train.py --text "your corpus here." --out runs/model --vocab-size 500

# chat with it
python3 chat.py --model runs/model

# analyse it
python3 analyse.py --model runs/model report

# run the tests
python3 test.py
```

© 2026 Clifford Chivhanga. Source code licensed under AGPL-3.0. Commercial licensing available — contact the author.
