There’s a quiet assumption in almost every AI discussion right now:
“If we scale compute and models, intelligence will keep improving.”
That assumption is starting to break.
Not loudly.
But structurally.
We’ve optimized for compute like it’s the main constraint.
GPUs. Clusters. Parallelism. Faster training runs.
But there’s a less visible constraint emerging:
We are running out of high-quality human data.
And worse:
We are replacing it with something fundamentally different.
Synthetic content generated by the very models we are training.
Early foundation models had something we are quietly losing:
A mostly human internet.
Not clean. Not structured. Not optimized.
But real.
This wasn’t “data”.
It was compressed human reasoning under constraint.
And it was chaotic in a useful way.
Fast forward to now.
A large and growing portion of the web is:
Individually, none of this looks dangerous.
Collectively, it creates something new:
A dataset increasingly shaped by model behavior, not human behavior.
This is the part most people underestimate:
We are entering a recursive training loop.
Human data → Model training → AI-generated content → New training data
Repeat.
Each cycle slightly reduces:
And increases:
This is not a hypothetical.
This is already happening.
There’s a subtle misconception in the field:
More compute = better intelligence
But compute doesn’t fix distribution collapse.
If your dataset slowly shifts toward: Then scaling just gives you:
faster convergence to the same middle-of-the-road answer Not deeper intelligence.
Just more confident imitation.
If you’ve used multiple LLMs recently, you’ve probably felt it: They are converging.
Not in capability.
In voice.
This isn’t coincidence.
It’s what happens when training distributions overlap and compress.
The system starts averaging itself.
This is why every major AI lab is quietly doing the same thing:
Because at this point:
High-quality human-generated data is no longer content. It is infrastructure.
And infrastructure determines ceilings.
Not model size.
People often ask:
“Will AI become too powerful?”
That’s the wrong failure mode.
A more realistic one is subtler:
AI systems becoming increasingly self-referential, trained on echoes of their own outputs.
Once that happens, you start losing:
And those are exactly the ingredients that produced breakthroughs in the first place.
We are likely splitting into two internet layers:
Expensive. Curated. Licensed. Hard to replicate.
Cheap. Scalable. Increasingly self-referential.
And the gap between these two will define model quality more than parameter count ever will.
We often say:
“AI is trained on the internet.”
That’s already outdated.
A more precise version might be:
“AI is now being trained on the internet after it has been shaped by earlier versions of AI.”
That single shift changes the entire system dynamics.
The internet didn’t just train AI.
It gave it structure, tone, and reasoning patterns.
Now AI is starting to feed back into that same system.
And the uncomfortable possibility is this:
We may be entering a phase where intelligence improvement is limited not by compute, but by how long we can preserve uncompressed human signal in a self-referential system.
Once that signal is gone, you don’t just lose data.
You lose variation.
And without variation, intelligence stops compounding.
If this resonates, I originally wrote the short-form version of this idea here: Would be interesting to hear other perspectives on this—especially from people building or training models today.