Most language models today are built around the Transformer paradigm.
That makes sense.
Transformers work.
They scale.
They dominate modern NLP.
But I wanted to explore a different question:
What if language generation does not need to be modeled as attention over a context window?
What if a model could generate language by carrying an evolving latent state through a learned geometry?
That is the idea behind DRM Language Emitter.
Repository:
https://github.com/gnai-creator/drm-language-emitter
DRM Language Emitter is an experimental, geometry-first language model lab.
It is not a Transformer.
Inside the DRM model, it does not use:
nn.MultiheadAttention
Instead, it treats language generation as controlled motion through a learned relational manifold.
The basic flow is:
token
-> latent state z_t
-> active directions
-> learned relational metric
-> controlled latent motion
-> next latent state z_{t+1}
-> token logits
The model is still autoregressive.
But its memory is not attention over a token sequence.
Its memory is the evolving latent state.
The working hypothesis is:
Language generation can be modeled as motion through a learned relational state space.
That means the model does not simply ask:
Which previous tokens should I attend to?
It asks something closer to:
Where am I in latent space?
Which directions are active?
How expensive is movement under the learned metric?
How should the state move before emitting the next token?
This is why I call it a geometry-first language emitter.
The architecture can be summarized as:
input_ids
|
TokenEmbedding
|
for each time step:
|
z_t
|
DirectionField(z_t)
-> directions V(z_t)
-> gates a(z_t)
-> effective active dimension dimD
|
RelationalMetric(z_t)
-> diag + U U^T
|
DRMFlow(z_t, token_embedding, directions, gates)
-> dz
|
Metric action g_z(dz, dz)
|
StateUpdater
-> z_{t+1}
|
LanguageEmitter(z_{t+1})
-> logits
A minimal conceptual version looks like this:
for token in sequence:
embedding = token_embedding(token)
directions, gates = direction_field(z)
metric = relational_metric(z)
dz = drm_flow(z, embedding, directions, gates)
action = metric_action(metric, dz)
z = state_updater(z, dz)
logits = language_emitter(z)
The important part is that the model has an explicit internal geometry.
It can log and measure:
This makes the model interesting not only as a generator, but also as an object of study.
A Transformer is the correct baseline.
That is why the repository includes tiny Transformer comparisons.
But the goal of DRM is not to replace Transformers by declaration.
The goal is to test whether a different computational primitive can be useful in small regimes.
The Transformer primitive is attention.
The DRM primitive is controlled latent motion under a learned metric.
These are very different assumptions.
A Transformer builds context by looking backward.
DRM carries context by evolving state forward.
A Transformer computes token-token interactions.
DRM computes state-motion-emission dynamics.
Because geometry gives us measurable structure.
If language is treated as a trajectory, we can ask questions like:
This opens the door to diagnostics that are harder to express in a standard black-box token predictor.
The goal is not mystical geometry.
The goal is measurable geometry.
The repository contains:
src/drm_language_emitter/ DRM model package
transformer/ tiny Transformer baseline
world_model/ tiny symbolic world-model baseline
scripts/ training, generation, evaluation, sweeps, dashboards
configs/ DRM and benchmark configs
docs/ math, limitations, competition notes, benchmark artifacts
tests/ smoke and invariant tests
The project is CPU-runnable.
CUDA is optional.
Install:
pip install -e .
Train a tiny DRM model:
python scripts/train_tiny.py \
--config configs/tiny.yaml \
--text data/tiny.txt
Generate text:
python scripts/generate.py \
--checkpoint runs/tiny/drm_tiny.pt \
--prompt "DRM "
Run geometry diagnostics:
python scripts/eval_geometry.py \
--checkpoint runs/tiny/drm_tiny.pt
python scripts/eval_geodesic_paths.py \
--checkpoint runs/tiny/drm_tiny.pt
The repository also includes a small symbolic benchmark.
This benchmark compares:
The task is a deterministic symbolic gridworld serialized as text.
The models need to predict symbolic transitions such as:
state + action -> next state + reward + done
This is not visual world modeling.
This is not a benchmark against large multimodal world models.
It is a tiny symbolic text-world designed to test whether models can learn discrete dynamics expressed as language.
The benchmark reports:
This is important because low loss alone does not necessarily mean correct symbolic dynamics.
A model can learn token-level regularities while still failing to predict exact state transitions.
The completed benchmark produced:
runs: 72
aggregate rows: 24
Top results by next-state exact match:
| Model | Steps | Family | Next-state exact match | Rollout exact match | Best CE | Invalid state rate | Params |
|---|---|---|---|---|---|---|---|
drm_tiny |
|||||||
| 2000 | DRM | 0.0751 | 0.0058 | 0.5511 | 0.1328 | 92,710 | |
transformer_tiny_220k |
|||||||
| 3000 | Transformer | 0.0563 | 0.0000 | 0.4008 | 0.0026 | 220,208 | |
transformer_tiny_93k |
|||||||
| 2000 | Transformer | 0.0516 | 0.0000 | 0.4594 | 0.2969 | 93,872 | |
world_model_tiny |
|||||||
| 2000 | World Model | 0.0476 | 0.0000 | 0.2573 | 0.4668 | 102,051 | |
world_model_tiny |
|||||||
| 3000 | World Model | 0.0415 | 0.0000 | 0.2497 | 0.4668 | 102,051 |
The most interesting part is not that DRM “wins everything”.
It does not.
The result is more nuanced:
So the honest interpretation is:
DRM shows an early signal on symbolic next-state prediction, but the benchmark is still diagnostic, not decisive.
For me, the most important takeaway is:
Low token-level cross-entropy does not automatically imply correct symbolic transition modeling.
That matters for world-model-like tasks.
If a model is supposed to represent dynamics, then we should not only ask whether it predicts likely tokens.
We should also ask whether it predicts valid states, exact transitions, and coherent rollouts.
I am not claiming that DRM is better than Transformers in general.
I am not claiming that DRM is better than world models in general.
I am not claiming that this benchmark says anything about large multimodal world models.
I am not claiming robust long-horizon planning.
This is a small research scaffold.
The results are early.
The exact-match values are still low.
The model needs more work.
DRM Language Emitter is a functional non-Transformer language model prototype.
It has explicit, measurable geometry.
It can be compared against Transformer and symbolic world-model baselines.
And in a tiny symbolic text-world benchmark, it showed an interesting signal on next-state exact match.
That is enough to keep investigating.
Generate the dataset:
python scripts/make_tiny_world_dataset.py \
--output-root data/tiny_world \
--seed 1 \
--grid-size 5 \
--num-train 20000 \
--num-val 2000 \
--max-rollout-len 8
Run the sweep:
python scripts/sweep_world_model_competition.py \
--steps 1000 2000 3000 \
--seeds 1 2 3 \
--dataset-root data/tiny_world \
--output-root runs/world_model_competition
Generate the dashboard:
python scripts/make_world_model_dashboard.py \
--root runs/world_model_competition \
--title "DRM vs Transformer vs Tiny Symbolic World Model"
The next things I want to improve are:
This project started from a simple intuition:
Maybe language generation can be treated as movement.
Not metaphorically.
Computationally.
A token enters.
A state moves.
A geometry shapes the motion.
A new token is emitted.
That is DRM Language Emitter.
Repository:
https://github.com/gnai-creator/drm-language-emitter
Feedback, criticism, reproduction attempts, and benchmark suggestions are welcome.