{"slug": "the-new-information-borders", "title": "The New Information Borders", "summary": "A developer discusses how AI crawlers and robots.txt policies are fragmenting the web's shared information environment. As websites selectively allow or block AI systems like ClaudeBot and GPTBot, retrieval access is diverging, potentially leading to different training corpora for AI models. This fragmentation arises from rational local decisions, not malicious intent, but could undermine the assumption of a common knowledge base.", "body_md": "Recently I came across a discussion about AI crawlers and `robots.txt`\n\nfiles. The conversation centered on a simple question:\n\n*Should website owners allow AI systems to access their content?*\n\nOne proposed configuration looked something like this:\n\n```\nUser-agent: ClaudeBot\nAllow: /\n\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: ChatGPT-User\nDisallow: /\n\nUser-agent: PerplexityBot\nDisallow: /\n```\n\nAt first glance this is a reasonable policy decision.\n\nPerhaps a company has a commercial relationship with one AI vendor and not another. Perhaps it trusts one organization more than another. Perhaps it simply dislikes a particular company and would rather that company not benefit from its content.\n\nThese are all rational decisions. And worth remembering: `robots.txt`\n\nis a request, not a wall. It governs the crawlers that choose to honor it. The borders we are about to talk about form through compliance norms and licensing agreements, not through technical enforcement.\n\nThe interesting part is what happens when thousands of organizations make similar decisions at once.\n\nFor most of the modern Internet era, there was an implicit assumption that people were operating from a broadly shared information environment.\n\nSearch engines differed in quality. Ranking algorithms differed. Some sources were easier to discover than others. But in general, if two people searched for information on a topic, there was a good chance they were drawing from many of the same underlying sources.\n\nThe web functioned as a largely shared corpus of knowledge.\n\nThat assumption may not hold forever.\n\nWhen people discuss information fragmentation, they often jump straight to government censorship, national firewalls, or deliberate propaganda systems.\n\nThose are real examples. But fragmentation does not require any malicious intent.\n\nImagine the following:\n\nNone of these organizations is trying to create information silos. Each is making what looks like a reasonable local decision.\n\nCollectively, those decisions begin to produce different information environments. The divergence does not emerge from AI reasoning. It emerges from AI access.\n\nNone of these organizations is trying to create information silos. They are simply trying to protect their intellectual property or negotiate a survival-level licensing deal in an ecosystem that no longer sends them traffic. Each is making what looks like a reasonable local decision.\n\nIt helps to separate two things that fragment differently.\n\nThe first is what a model was trained on. The second is what a model can reach at the moment you ask it a question.\n\nToday these overlap heavily. Most large models are built from many of the same underlying sources: the same crawled archives, the same bulk licensing deals, the same public web that has been scraped for years. At the training layer, the corpus is still mostly shared.\n\nRetrieval is where the divergence is already happening.\n\nWhen a model answers using live access to the web, the `robots.txt`\n\nrules, the licensing agreements, and the private deals all decide what it is permitted to pull in right then. One system can cite a source. Another is told it may not look. Same question, different evidence, and the difference has nothing to do with how either model reasons.\n\nSo the honest version of the claim is not that Claude and ChatGPT already see two different webs. It is narrower and more defensible:\n\nRetrieval access is fragmenting now. Training access could follow.\n\nThat second part is the one worth watching. If exclusive licensing becomes the norm rather than the exception, the divergence stops being a retrieval-time quirk and starts being baked into what each model knows at all. The shared corpus we have taken for granted would quietly stop being shared.\n\nWhen two AI systems produce different answers, we tend to assume the difference lies in how the models reason.\n\nSometimes that is true. Increasingly, though, the more important question may be a different one: what information was the model allowed to see?\n\nAn answer generated from complete evidence and an answer generated from partial evidence can both arrive with equal confidence. Only one of them may reflect the full record.\n\nThe distinction matters.\n\nA model cannot mourn the data it was never allowed to read. It simply synthesizes a flawless, highly confident answer out of the fragment it has, leaving the user entirely unaware of the missing horizon.\n\nOne reason I spend so much time thinking about provenance is that museums, archives, and historians have wrestled with these questions for decades.\n\nResearchers care not only about what artifacts exist. They care about what artifacts are missing. Absence affects interpretation. A collection missing half of its records tells a different story than a complete one, and a careful researcher never mistakes the surviving fragment for the whole.\n\nAI systems face the same challenge. A model can only reason from the evidence available to it. If the evidence becomes fragmented, the resulting interpretations may diverge even when the underlying reasoning processes remain sound.\n\nThe [Sovereign Systems Specification](https://kenwalger.github.io/sovereign-system-spec/?utm_source=devto&utm_medium=article&utm_campaign=new_information_borders) is built around a simple observation:\n\nInformation without provenance is just gossip.\n\nMost discussions of provenance focus on where information came from. The harder and more neglected question is what was left out.\n\nNot only:\n\nWhere did this information originate?\n\nBut also:\n\nWhat information was unavailable?\n\nWhat information was excluded?\n\nWhat information was never allowed into the system at all?\n\nAbsence is itself a provenance category. A record of what a system could not see is as much a part of its lineage as a record of what it could. Those questions become more important, not less, as AI systems become primary interfaces to knowledge.\n\nWhile commercial cloud models hide their data deficits behind a smooth conversational curtain, a Sovereign system must explicitly map its own borders—declaring exactly what lies within its registry, and where the boundary of its knowledge ends.\n\nI do not believe AI is creating separate realities. We are.\n\nNot through any coordinated effort. We are simply making thousands of local decisions about access, licensing, trust, governance, and control.\n\nThe cumulative effect may be the emergence of informational borders that are far less visible than national borders but no less consequential.\n\nSo here is the thing to watch for. The next time two AI systems hand you different answers, do not stop at asking which one reasoned better. Ask what each one was allowed to see. The gap between them may have nothing to do with intelligence and everything to do with access.\n\nThe web once assumed a largely shared corpus of knowledge. The next generation of knowledge systems may not.\n\nWhen two AI systems disagree, are we observing different reasoning? Or are we observing different worlds?", "url": "https://wpnews.pro/news/the-new-information-borders", "canonical_source": "https://dev.to/kenwalger/the-new-information-borders-1110", "published_at": "2026-06-29 13:30:00+00:00", "updated_at": "2026-06-29 13:49:00.894124+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-policy", "ai-ethics", "ai-agents"], "entities": ["ClaudeBot", "GPTBot", "ChatGPT-User", "PerplexityBot", "Claude", "ChatGPT"], "alternates": {"html": "https://wpnews.pro/news/the-new-information-borders", "markdown": "https://wpnews.pro/news/the-new-information-borders.md", "text": "https://wpnews.pro/news/the-new-information-borders.txt", "jsonld": "https://wpnews.pro/news/the-new-information-borders.jsonld"}}