cd /news/artificial-intelligence/we-didnt-just-train-ai-on-the-intern… · home topics artificial-intelligence article
[ARTICLE · art-16884] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=↓ negative

We Didn’t Just Train AI on the Internet. We Started Training It on Itself.

A growing portion of the internet is now generated by AI, creating a recursive training loop where models are increasingly trained on their own outputs rather than on original human data. This shift is causing a subtle but critical problem: as training distributions compress and overlap, AI systems are converging in voice and losing the variation and novelty that drove earlier breakthroughs. The preservation of high-quality human-generated data has become a form of infrastructure that may ultimately determine the ceiling on intelligence improvement, not compute or model size.

read3 min publishedMay 28, 2026

There’s a quiet assumption in almost every AI discussion right now:

“If we scale compute and models, intelligence will keep improving.”

That assumption is starting to break.

Not loudly.

But structurally.

We’ve optimized for compute like it’s the main constraint.

GPUs. Clusters. Parallelism. Faster training runs.

But there’s a less visible constraint emerging:

We are running out of high-quality human data.

And worse:

We are replacing it with something fundamentally different.

Synthetic content generated by the very models we are training.

Early foundation models had something we are quietly losing:

A mostly human internet.

Not clean. Not structured. Not optimized.

But real.

This wasn’t “data”.

It was compressed human reasoning under constraint.

And it was chaotic in a useful way.

Fast forward to now.

A large and growing portion of the web is:

Individually, none of this looks dangerous.

Collectively, it creates something new:

A dataset increasingly shaped by model behavior, not human behavior.

This is the part most people underestimate:

We are entering a recursive training loop.

Human data → Model training → AI-generated content → New training data

Repeat.

Each cycle slightly reduces:

And increases:

This is not a hypothetical.

This is already happening.

There’s a subtle misconception in the field:

More compute = better intelligence

But compute doesn’t fix distribution collapse.

If your dataset slowly shifts toward: Then scaling just gives you:

faster convergence to the same middle-of-the-road answer Not deeper intelligence.

Just more confident imitation.

If you’ve used multiple LLMs recently, you’ve probably felt it: They are converging.

Not in capability.

In voice.

This isn’t coincidence.

It’s what happens when training distributions overlap and compress.

The system starts averaging itself.

This is why every major AI lab is quietly doing the same thing:

Because at this point:

High-quality human-generated data is no longer content. It is infrastructure.

And infrastructure determines ceilings.

Not model size.

People often ask:

“Will AI become too powerful?”

That’s the wrong failure mode.

A more realistic one is subtler:

AI systems becoming increasingly self-referential, trained on echoes of their own outputs.

Once that happens, you start losing:

And those are exactly the ingredients that produced breakthroughs in the first place.

We are likely splitting into two internet layers:

Expensive. Curated. Licensed. Hard to replicate.

Cheap. Scalable. Increasingly self-referential.

And the gap between these two will define model quality more than parameter count ever will.

We often say:

“AI is trained on the internet.”

That’s already outdated.

A more precise version might be:

“AI is now being trained on the internet after it has been shaped by earlier versions of AI.”

That single shift changes the entire system dynamics.

The internet didn’t just train AI.

It gave it structure, tone, and reasoning patterns.

Now AI is starting to feed back into that same system.

And the uncomfortable possibility is this:

We may be entering a phase where intelligence improvement is limited not by compute, but by how long we can preserve uncompressed human signal in a self-referential system.

Once that signal is gone, you don’t just lose data.

You lose variation.

And without variation, intelligence stops compounding.

If this resonates, I originally wrote the short-form version of this idea here: Would be interesting to hear other perspectives on this—especially from people building or training models today.

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/we-didnt-just-train-…] indexed:0 read:3min 2026-05-28 ·