After looking into it a bit, this is how I’d read it:
Short version
I would frame this primarily as a custom decoding / probability-reweighting method, with an embedding-space semantic field as the guidance signal.
The strongest next step, in my opinion, would be to make the mechanism easier to inspect rather than trying to prove the whole system at once:
- add a minimal
LogitsProcessor
-compatible path,
- print the top boosted / nearest tokens per body,
- separate universe-only, local-only, and combined modes,
- add shuffled-centroid / random-cluster / no-IDF / uniform-mass ablations,
- compare against stronger decoding baselines than temperature alone,
- report repetition, drift, diversity, and latency together.
I would avoid claiming too early that “gravity replaces temperature.” A safer and more testable framing is:
this adds a semantic-field reweighting term to the next-token distribution; now test which part of that term is actually carrying the effect.
My high-level read
The core mechanism seems to be a probability reweighting rule over the model’s next-token distribution. In the repo description, the model first produces logits, then the probabilities are multiplied by something like a semantic force term and renormalized.
Conceptually, if the method is doing something like:
p' \propto p \cdot (1 + force)
then a logit-side implementation can be viewed approximately as:
scores' = scores + \log(1 + force)
That makes the method fit pretty naturally into the Hugging Face generation vocabulary: custom decoding, guided sampling, or a custom LogitsProcessor
, rather than a new trained model.
The interesting part is not just the gravity metaphor. To me, the more important decomposition is:
| Component | What it may contribute | What should be tested separately | | base LM probability | keeps the model’s own distribution | whether steering overrides or gently modifies | | semantic bodies | embedding-space clusters / centroids | whether real geometry matters | | universe field | global vocabulary-level semantic structure | whether static bodies help by themselves | | local bodies | prompt/generated-context bodies | whether local feedback helps or collapses | | IDF / mass weighting | common-token suppression and body strength | whether IDF or mass is doing most of the work | | AdaptiveG | feedback control of gravity strength | whether it stabilizes generation | | persistence | memory-like reuse of bodies | useful, but probably a separate evaluation axis |
So I would split the claims. For example:
- “The HF integration path works.”
- “The reweighting changes generation.”
- “The semantic geometry matters.”
- “The method improves quality.”
- “AdaptiveG stabilizes generation.”
- “Persistence helps across sessions.”
Those are different claims, and they need different tests.
What I would test first
The main thing I would want to know is not only whether the outputs look better, but what part caused the change.
A compact first-pass ablation plan could be:
| Test | Purpose | | real centroids vs shuffled centroids | checks whether semantic geometry matters | | real centroids vs random clusters | checks whether cluster structure matters | | IDF vs no-IDF | checks whether common-token suppression is doing most of the work | | uniform mass vs size/IDF mass | checks whether the mass function matters | | universe-only | isolates the global semantic field | | local-only | checks local context feedback and collapse risk | | universe + local | checks whether global bodies stabilize local bodies | | fixed G vs AdaptiveG | checks whether feedback control helps | | force-scale sweep | checks whether the result is brittle to one chosen G | | latency per token | checks whether the method is practical |
If I had to pick only two ablations, I would start with:
real centroids vs shuffled centroids, keeping the same mass/IDF setup; IDF vs no-IDF, keeping the same geometry.
Those two would already tell readers a lot about whether the semantic geometry is doing the work, or whether the improvement mostly comes from common-token filtering / rare-token boosting / generic perturbation.
#
More detailed ablation matrix
A more complete matrix could look like this:
| Condition | What it isolates | Useful observation | Caution | | temperature baseline | simple sampling baseline | basic output comparison | weak baseline alone | | top-p / nucleus baseline | common practical sampling baseline | whether gravity beats ordinary truncation | tune fairly | | typical sampling baseline | information-theoretic sampling baseline | repetition / typicality comparison | implementation details matter | | universe-only, IDF mass | global semantic field | cleanest first semantic-field test | may still be mostly IDF | | universe-only, no-IDF | geometry without common-token suppression | whether geometry survives without IDF | common tokens may dominate | | universe-only, uniform mass | mass ablation | whether mass function matters | may weaken intended design | | shuffled-centroid universe | geometry control | whether semantic location matters | fluent output can still happen | | random-cluster universe | cluster control | whether clustering matters | match cluster count/size if possible | | local-only | local feedback | whether context bodies help | can cause lock-in or repetition | | universe + local | interaction | whether global field stabilizes local field | harder to attribute | | fixed G | static strength | simpler baseline for gravity | may be brittle | | AdaptiveG | feedback control | whether controller stabilizes behavior | log the G trajectory | | deterministic universe mode | geometry-driven deterministic mode | whether geometry alone adds useful diversity | compare carefully with stochastic baselines |
For each condition I would save:
- prompt
- seed
- generated text
- selected tokens
- repeated n-gram rate
- distinct-n
- top boosted tokens
- force magnitude stats
- active body count
- before/after token ranks
- ms/token
- memory usage
- universe build time
The important point is that output samples alone are not enough. A fluent sample does not prove the semantic geometry is doing the work, and a bad sample does not disprove the full method.
Hugging Face integration path
For Hugging Face users, I think the lowest-friction entry point would be a minimal LogitsProcessor
version.
It does not need to include the full universe/local/persistence system at first. A first version could just expose the reweighting rule:
- compute or load a
force_magnitudes
vector over the vocabulary,
- apply a logit-side update such as
scores += log1p(force_magnitudes)
,
- let
generate()
handle ordinary sampling controls like top-p / temperature,
- log the top boosted tokens and before/after scores.
The current Transformers docs describe LogitsProcessor
as the mechanism for modifying generation scores, and the generation strategies guide describes decoding strategy as the way the model selects the next token. That seems like the most natural public interface for this idea.
One implementation detail I would include early: test with renormalize_logits=True
. The text generation API docs note that some logits processors can break normalization assumptions, and custom processors are exactly the kind of thing where explicit renormalization can make debugging less ambiguous.
#
Minimal processor-shaped sketch
The smallest version could be something like:
import torch
from transformers import LogitsProcessor
class SemanticForceProcessor(LogitsProcessor):
def __init__(self, force_magnitudes: torch.Tensor):
self.force_magnitudes = force_magnitudes
def __call__(self, input_ids, scores):
force = self.force_magnitudes.to(scores.device, dtype=scores.dtype)
return scores + torch.log1p(force).unsqueeze(0)
And the generation call could start with something like:
outputs = model.generate(
**inputs,
do_sample=True,
top_p=0.9,
temperature=0.8,
logits_processor=[SemanticForceProcessor(force_magnitudes)],
renormalize_logits=True,
return_dict_in_generate=True,
output_scores=True,
)
That would not prove the full method, but it would give forum readers a small reproducible path:
- same model,
- same prompt,
- same seed,
- same baseline sampling config,
- custom field on/off,
- saved outputs,
- saved top boosted tokens.
If exact ordering with top-p / temperature matters, I would pin the Transformers version and log both raw and processed values where available. The goal is to inspect what the field changed, not only the final text.
Diagnostics I would add before quality claims
Before making strong quality claims, I would add diagnostics for the field itself.
The most useful one:
print the top nearest / top boosted tokens per body.
This catches a very common ambiguity. The abstract field may be intended to be semantic, but the actual boosted tokens might be common words, punctuation, whitespace-prefixed fragments, or tokenizer artifacts.
I would log:
| Diagnostic | Why it matters | | top nearest tokens per body | checks whether the body is interpretable | | top boosted tokens after IDF/mass | shows what actually affects generation | | raw cosine similarity distribution | checks whether the field is flat or hub-like | | force magnitude distribution | checks whether one body dominates | | base probability before boost | distinguishes steering from overriding | | probability/rank after boost | measures actual decoding effect | | selected token before/after rank | shows whether sampled token was materially affected | | token category counts | detects function words, punctuation, fragments | | active body count | helps interpret behavior and latency |
This is important because cosine geometry in transformer embedding spaces can be noisy. That does not mean cosine distance is unusable. It only means the geometry contribution should be diagnosed rather than assumed. Relevant background includes work on transformer representation anisotropy, such as Anisotropy Is Inherent to Self-Attention in Transformers, and work on rogue dimensions affecting similarity measures in transformer LMs.
#
Geometry and tokenizer failure modes to check
Some concrete failure modes I would check:
| Failure mode | Symptom | Diagnostic | | common-token attraction | top boosts are “the”, “a”, “is”, punctuation, EOS | top boosted token dump, IDF/no-IDF comparison | | tokenizer artifact attraction | boosted tokens are fragments rather than meaningful words | token string categories | | whitespace-prefix effects | many top tokens are space-prefixed artifacts | category counts | | anisotropy | many tokens have very similar cosine values | cosine distribution | | hubness | one centroid attracts many unrelated tokens | nearest-neighbor concentration | | over-strong force | repetition / topic lock-in | G sweep, repeated n-grams | | local feedback loop | generated terms reinforce themselves | local-only long generation | | base LM prior hiding the effect | shuffled control still looks fluent | shuffled/random controls |
This is where IDF and mass design may be genuinely important. But I would still separate:
- geometry,
- IDF,
- mass,
- force scale,
- local feedback.
Otherwise it is hard to know which part is responsible.
Evaluation and baselines
I would avoid a temperature-only comparison.
Temperature is a useful knob, but open-ended generation has several strong decoding baselines. I would include at least:
- temperature sampling,
- top-p / nucleus sampling,
- top-k sampling,
- typical sampling,
- possibly Mirostat or another adaptive baseline,
- possibly contrastive search / contrastive decoding if convenient.
The reason is that decoding can fail in different ways:
| Failure mode | Example | | repetition | loops, repeated phrases, repeated n-grams | | topic lock-in | staying too tightly in one semantic basin | | drift | fluent but off-topic continuation | | blandness | generic high-probability text | | incoherence | diverse but unstable text | | latency overhead | better text but too slow per token |
Papers like The Curious Case of Neural Text Degeneration, Locally Typical Sampling, Mirostat, and Contrastive Decoding are useful context here.
For metrics, I would not rely on distinct-n alone. It is useful, but diversity can increase while quality gets worse. I would combine:
- repeated n-gram rates,
- distinct-n,
- unique word/token ratio,
- prompt adherence or drift checks,
- saved samples,
- lightweight human inspection,
- possibly MAUVE,
- latency per token,
- memory and active body counts.
MAUVE can be useful as a distribution-level signal, but I would not make it the only evaluation. Automatic metrics and human judgments can disagree, so I would treat it as one piece of evidence rather than the final answer.
#
Possible evaluation table
A practical benchmark table could have columns like:
| Column | Meaning | | method | baseline / universe-only / local-only / etc. | | model | GPT-2, GPT-2 medium, or another public model | | seed | reproducibility | | prompt group | story, technical, factual, etc. | | max_new_tokens | generation length | | top_p / temperature | sampling config | | G values | gravity strengths | | IDF mode | on/off | | mass mode | uniform/size/IDF | | active bodies | field complexity | | repeated 2-gram / 3-gram | repetition | | distinct-1 / distinct-2 | lexical diversity | | drift score or notes | prompt adherence | | ms/token | practical cost | | sample output | human inspection |
I would also separate short and long generations. A method can look good for 50 tokens but collapse, drift, or become repetitive over 300-500 tokens.
Suggested roadmap
Here is the path I would take if the goal is to make this easier for other HF users to evaluate.
Path A: make it easier to run
- add a minimal reproducible smoke test,
- use a small public model first,
- pin seeds,
- save outputs and diagnostics,
- expose a minimal
LogitsProcessor
path.
Path B: make the mechanism inspectable
- print top boosted / nearest tokens,
- save force distributions,
- save before/after token ranks,
- add common-token / BPE-fragment diagnostics,
- log active body counts.
Path C: make the claim testable
- real vs shuffled centroids,
- random clusters,
- IDF vs no-IDF,
- uniform mass vs IDF/size mass,
- universe-only vs local-only vs combined,
- fixed G vs AdaptiveG,
- force-scale sweeps.
Path D: make the comparison fair
- compare against top-p and typical sampling, not only temperature,
- include repetition and drift metrics,
- include human spot checks,
- report latency and memory.
This gives readers several ways to engage. Someone interested in implementation can try the processor. Someone interested in research can look at the ablations. Someone interested in practical use can look at latency and failure modes.
Bottom line
I think this is a useful direction to explore, but I would make the next iteration less about proving the metaphor and more about exposing the mechanism.
The strongest compact package would be:
minimal LogitsProcessor
demo + top-boosted-token diagnostics + real/shuffled centroid ablation + no-IDF/uniform-mass ablation + universe/local/AdaptiveG separation + stronger sampling baselines.
That would make it much easier for readers to tell whether the useful part is:
- semantic geometry,
- IDF/common-token suppression,
- mass design,
- adaptive control,
- local-context feedback,
- or just generic perturbation of the next-token distribution.
If that separation is clear, the idea will be much easier to discuss and build on.