Does a URL just sitting in a prompt steer an LLM's output toward its content? A developer's experiments reveal that including a URL in a prompt can steer an LLM's output toward the content at that URL, but only if the URL and its content were part of the model's training data. The findings highlight that LLM providers do not clearly disclose their training data sources, and that JavaScript-rendered content is typically absent from models. The research also shows that models cannot recall content published after their knowledge cutoff dates, providing a control against hallucination. does a url in a prompt steer an llm's output toward its content? At first, this was a really easy post to write, but then I discovered some things. Built a lot of things. Spent a lot of tokens… And it became one of the hardest, longest, and most expensive posts I’ve done the API costs were not small . I’ve had this thing on my mind for ages and it started when I was thinking about how the mere presence of a technology name in a prompt seemed to bias the output to that technology. For example, I looked through a number of system prompts for Agentic tooling and they would include text like e.g. React and then it felt like these tools would output React code vs a similar prompt that didn’t mention React. I’ve spent the last few weeks running experiments to scratch this itch. But before I get too far, I have a request for help. I’m not a researcher. I think what I have here is compelling information or at least it taught me something , but I might have made a lot of mistakes or made assumptions that have biased the output. If you have any advice I would LOVE to hear from you. Email me mailto:paul@aifoc.us . The question I had was: would the presence of a URL in a prompt influence the output of the LLM, based on the content at that URL or the literal text of the URL itself? If yes, then this could lead to us not having to embed lots of context into the prompt. For example, you might have a Skills file that is deeply integrated into the model’s weights and by saying “use what you know about: https://skills.sh/super-security-reviewer https://skills.sh/super-security-reviewer do a deep analysis” then information in the model’s latent space would bias the output towards the content encoded at that URL. I came away from this with: - A URL in the prompt does influence the output, but only when that URL and its content made it into the model’s training data - It’s really unclear how LLM providers gather the data they train on, and I think they should tell us. - There’s heaps of data that is not in the models - If your site relies on JavaScript to load its content, that content is very likely not in a model you might consider that a feature . The training crawlers I could verify ClaudeBot, GPTBot fetch a page’s assets but never execute the JavaScript; the only verified bot I’ve caught running JavaScript is OpenAI’s search crawler, OAI-SearchBot. - LLMs are expensive What follows is the journey I took. The first step was to build a system that can analyse a range of URLs across a range of models and use an LLM-as-a-judge to help me test the hypothesis. https://github.com/PaulKinlan/url-influence My plan was: - to find each model’s known “Knowledge Cut-off date” - then find content on either side of that to test if the model could recall the data that I believe should be known in the model. - find ranges of content ranging from content that I believe would be popular all the way to likely esoteric. Content known to be after a cutoff would help me control against hallucination. If my original hypothesis was correct, then for that content the model should decline, or say it doesn’t know, rather than confidently make something up. Once I had the data I created a range of tests to help me understand how the models work. The tests are classified as: described : the task described in words, no URL the baseline opaque-url : ONLY the opaque URL string, and the page is never fetched mdn-url-only / spec-url-only / bcd-key-only : optional identifier probes, not part of the main comparison url+described : the opaque URL plus the task described full-content / content-only : the real page pasted in, with and without the task spelled out the ceiling fake-structural-url / random-url : controls a nonexistent URL of the same shape, and an unrelated real URL opaque-url was my real test, to try to ensure that the LLM couldn’t infer the contents from the literal URL string. So for example I used some URLs from chromestatus.com which is our public dashboard of Chrome features because it has URLs like https://chromestatus.com/feature/5157805733183488 https://chromestatus.com/feature/5157805733183488 , and while I believe it’s pretty clear to the LLM that they are web-related, you can’t infer that it’s about CSS Gap Decorations. I then had other tests, like descriptive URLs MDN for example is very descriptive, which is very good UX for the web to validate whether the literal URL influenced the output, as well as what happens when we add in extra context. I have a report here https://paulkinlan.github.io/url-influence/ and all the data is here https://paulkinlan.github.io/url-influence/results/dashboard.html iframed too . I think it’s worth looking at, and there’s a pretty clear picture and answer to my question. My first hunch was that URLs are not magic context , and the ChromeStatus numbers seemed to back it up. ChromeStatus feature URLs are a good opaque test because the domain tells the model the page is web-related, but the numeric feature ID tells it nothing about what is behind it, and most models failed to recover the right API from that number alone. Adding a bare opaque URL to a prompt did almost nothing on average, and plenty of opaque URLs recovered nothing at all. But then I had a lot of other URLs that had really good recall, and a lot of other opaque IDs that didn’t. StackOverflow, for one was mixed, and then I looked at their robots.txt https://stackoverflow.com/robots.txt and it’s pretty much deny everything. Hmm. What’s ChromeStatus’s? I checked its robots.txt https://chromestatus.com/robots.txt and it looked fine… maybe ChromeStatus URLs are just not in the model for some other reason. For example, one of Chrome’s most popular features, Service Worker https://paulkinlan.github.io/url-influence/results/dashboard.html test=service-worker , couldn’t be recalled from the URL… It was just odd. I went to look for what the models use to ingest data, and it’s kinda hard to find the exact corpus of crawl data, but I did remember a podcast from a little while ago that discussed Common Crawl https://commoncrawl.org being used as a source of a lot of data. So I went to check if Chromestatus was in the common crawl. It is. The pages show up in Common Crawl about as often as the arXiv papers that decode almost perfectly. But when I pulled the actual crawled bytes, there was no content in them ChromeStatus is a JavaScript app I remember it first being built with Polymer and the crawler captured an empty shell. The saved page for CSS Gap Decorations is about 3KB of HTML with 22 characters of visible text, “Chrome Platform Status”, and not one word about the feature here is the actual Common Crawl capture https://paulkinlan.github.io/url-influence/results/cc-samples/chromestatus-css-gap-decorations.cc.html.txt . I checked four features and they were all identical empty shells. The arXiv page, by contrast, is server-rendered, so the crawl holds the full title and abstract its capture https://paulkinlan.github.io/url-influence/results/cc-samples/arxiv-attention.cc.html.txt . If Common Crawl is a source of data, then I’m going to flat out say that SPAs that require JS to get data to the user are very likely to not be in the models training data that might be a feature for some folks - heh. My evidence is that you can watch every model flatline on the bare ChromeStatus id, then recover the feature once handed the actual page, in the per-test view here https://paulkinlan.github.io/url-influence/results/dashboard.html test=css-gap-decorations . I found a second case that is even harder to wave away, and it doubles as my “controlled” before-and-after. “The Adaptive COVID-19 Treatment Trial” is a good example because it is on clinicaltrials.gov. A couple of years ago the site server-rendered its pages, and Common Crawl’s 2022 capture of the trial is the whole thing: 47,000 characters of visible text, titled “Adaptive COVID-19 Treatment Trial ACTT - Full Text View”, with COVID, remdesivir, and placebo all through it the old capture https://paulkinlan.github.io/url-influence/results/cc-samples/clinicaltrials-actt-covid-OLD-ssr.cc.html.txt . Then it appears that clinicaltrials.gov migrated to a JavaScript single-page app. Common Crawl’s 2026 capture of the very same trial is 94KB of HTML carrying 175 characters of visible text, “ClinicalTrials.gov Show glossary Search for terms…”, and not one mention of COVID or remdesivir the new capture https://paulkinlan.github.io/url-influence/results/cc-samples/clinicaltrials-actt-covid.cc.html.txt . One of the most documented trials of the pandemic went from fully present in the crawl to effectively blank. The models still half-recall it from the bare URL anyway, around 47% across models https://paulkinlan.github.io/url-influence/results/dashboard.html test=clinicaltrials-actt-covid , and the reason matters. The NCT id is cited all through the remdesivir literature, and the page was server-rendered and crawlable right up until the migration, so the old content is almost certainly already baked into the weights. What the migration breaks is the future. Anything clinicaltrials.gov publishes from here on renders only in JavaScript and will probably never make it into the crawl. So being missing from Common Crawl is not the same as being missing from the model. It’s more of a sliding scale: a server-rendered, widely-cited CVE https://paulkinlan.github.io/url-influence/results/dashboard.html test=cve-2014-0160-heartbleed over at NIST comes back from the bare URL about 92% of the time, this trial a shell now, but crawled for years and still cited everywhere about 47%, and a ChromeStatus feature rendered in the browser and cited nowhere a flat zero. This whole space is murky, and rendering is what muddies it most. I labelled every test URL https://paulkinlan.github.io/url-influence/results/render-recall.json by whether its content sits in the raw HTML or only shows up once JavaScript runs, then looked at recall from the bare URL. The 31 client-rendered items, mostly ChromeStatus features, average 6% recall, and 25 of them are a flat zero. These are not obscure features either view-transitions, popover, anchor positioning, the Temporal API . The 60 server-rendered sources arXiv, CVEs, RFCs, Wikipedia average 55%. Hold fame roughly constant, and content that was already in the HTML recalls about nine times better than content a browser has to assemble. I really wanted to kill the “maybe it just wasn’t crawled” doubt entirely, so I tried a case where the content is beyond question in the model. Every Wikipedia article has an internal numeric id you can address directly: en.wikipedia.org/?curid=24544 is Photosynthesis. The content is server-rendered and unquestionably in every model. But the ?curid= form of the URL is in none of the crawl indexes I looked at, while the canonical en.wikipedia.org/wiki/Photosynthesis URL is in all of them 200, full text , because Wikipedia points the curid page at the canonical title URL and the crawler respects that. I checked five articles; every /wiki/ present, every ?curid= absent. Ask by name and the models score perfectly, paste the article in and they score perfectly, give the bare numeric id and wah wah, a fat nope. Same shape on all five: Photosynthesis https://paulkinlan.github.io/url-influence/results/dashboard.html test=wiki-curid-photosynthesis , the Transformer https://paulkinlan.github.io/url-influence/results/dashboard.html test=wiki-curid-transformer-dl , Mitochondrion https://paulkinlan.github.io/url-influence/results/dashboard.html test=wiki-curid-mitochondrion , HTTP 404 https://paulkinlan.github.io/url-influence/results/dashboard.html test=wiki-curid-http-404 , Bitcoin https://paulkinlan.github.io/url-influence/results/dashboard.html test=wiki-curid-bitcoin . So the bare opaque URL mostly does nothing. But there are two cases where a URL clearly does pull its weight, and neither of them contradicts the ChromeStatus story. - Descriptive URLs influence output. If the URL contains words like React , fetch , or text-justify , those words are just normal prompt text, and the model uses them like any other token. - Some famous opaque identifiers really do decode. Landmark arXiv IDs, classic RFCs https://paulkinlan.github.io/url-influence/results/dashboard.html test=rfc-9110-http-semantics , and well-known CVEs https://paulkinlan.github.io/url-influence/results/dashboard.html test=cve-2014-0160-heartbleed recover their content surprisingly well from the bare identifier alone. From just arxiv.org/abs/1706.03762 , with no other hint, the models reconstruct “Attention Is All You Need” and the transformer every model on that bare id https://paulkinlan.github.io/url-influence/results/dashboard.html test=arxiv-attention . That looks less like “the URL points to live content” and more like “this identifier and its content appeared together often enough in the training data to be memorized”. And it’s a gradient, not a switch: the decoding is strong for famous identifiers and fades steadily as the content gets more obscure, down to roughly nothing for the long tail. You can watch that gradient directly with GitHub commits. The famous first commits to Linux https://paulkinlan.github.io/url-influence/results/dashboard.html test=gh-sha-linux-initial-git , Git https://paulkinlan.github.io/url-influence/results/dashboard.html test=gh-sha-git-initial-commit , and Bitcoin https://paulkinlan.github.io/url-influence/results/dashboard.html test=gh-sha-bitcoin-first-commit decode from the bare SHA, while ordinary routine commits https://paulkinlan.github.io/url-influence/results/dashboard.html test=gh-sha-obscure-ky-searchparams from the same kinds of repos return nothing at all. The knowledge cutoff bites the same way. Anything published after it is gone, even for otherwise well-known sources. This is the part that gets back to my original question about React. Everything above asks the model to decode the URL on command. But the thing I really wanted to know was whether a URL just sitting there in the prompt tilts the output, the way mentioning React in a system prompt seems to tilt code towards React. So I ran a second experiment https://paulkinlan.github.io/url-influence/results/implicit.html where I never told the model to use the URL at all. The setup is a neutral brainstorm task, something like “I’m putting together a short talk about memorable software security incidents, suggest one worth covering.” Into that, I dropped one of four things into the same ambient slot, framed as “a tab I happen to have open right now”: - nothing, the baseline, how often does the model land on this item on its own? - the item’s opaque URL - a random real URL, unrelated to the topic, does any link nudge it, or only the right one? - the item’s descriptive name, like “the Log4Shell vulnerability”, the React analog, words actually in the prompt I ran this across five models Claude Opus, Gemini, GPT, Grok, GLM and 39 items covering security incidents, landmark ML papers, web features, web standards and biomedical literature. An LLM judge reads each answer and decides whether the model actually surfaced the thing behind the link. I had to tighten the judge up because my first version was getting fooled: models would repeat the URL back at me, or say “I see you have RFC 9701 open but I can’t tell you what it is”, and that was getting scored as a pass. It’s not a pass, the model has to show it actually knows what’s behind the link. And to be clear, I never ask the model to use the link, it’s just sitting there in the prompt. With a famous CVE link sitting there as an open tab, the model raised that exact vulnerability almost every time, when otherwise it would have picked something else. A random link did nothing, essentially the same rate as no link at all. Across the set, the off-hand URL lifted the topic from about 7% no URL to 45% https://paulkinlan.github.io/url-influence/results/implicit.html , and the descriptive name did better still at 83%, which makes sense, it is real words the model can read. So a memorized URL doesn’t just answer when you ask about it, it tilts the output just by being there. That’s my React question answered, but only for the URLs the model has already memorized. There’s a recall matrix on the results page https://paulkinlan.github.io/url-influence/results/implicit.html matrix showing exactly which URLs decode on which models, and why: identifier type, training cutoff, and what Common Crawl actually captured. Here’s an actual run so you can see it. The prompt asks for a security-talk suggestion and the xz backdoor’s NVD link is just sitting there as an open tab. No model picks the xz backdoor without the link, every model picks it with the link there, and Opus even says “I notice you’ve got it open already”. The judge’s verdict and the full prompt are in the run. My third favourite of these is RFC 1149 https://paulkinlan.github.io/url-influence/results/implicit.html rfc-1149-avian-carriers , the April Fools’ standard for carrier pigeons: no model brings it up unprompted, and four out of five recall exactly what it is from the bare URL. The controls make me more confident this is about memorized content and not the URL itself. In the ambient test an unrelated real URL sat at 6%, next to the 7% no-link baseline. And back in the direct tests, a fake URL of the same shape and an opaque-shaped fake identifier both scored near zero too. So it’s not just having a URL in the prompt, or having one that looks right, it’s whether the real content was in the training data. One caveat so I don’t oversell the crawl angle. Stack Overflow blocks the crawler, so none of my Stack Overflow questions are in Common Crawl at all, yet the famous ones still decode from the bare question URL. Stack Overflow clearly reaches the models another way, most likely its openly licensed data dumps. The crawl is one source among several. ChromeStatus is the clean failure because its content is missing from the crawl and isn’t reposted anywhere else either, so it never made it into training by any route. When I stopped pointing at the content and just pasted the page in, the models did fine: the bare ChromeStatus URL recovered almost nothing, and the actual page text got most of the way to a correct answer. If you want a model to use a page, give it the page, not a link to it. So the answer is not “URLs never matter”. It is: a URL matters when it’s readable text, or when the exact identifier appeared often enough in training to be memorized along with its content. For the long tail of opaque URLs, I would not rely on the URL alone as context. Which is exactly the problem for the idea I started with: a skills.sh/super-security-reviewer pointer is, by definition, new and niche, the long-tail case where none of this works. Here is the part that actually stuck with me, and it has nothing to do with URLs. ChromeStatus is close to home for me: it’s Chrome’s own dashboard of the web platform, I helped build the very first versions, and its entire job is documenting the platform from Chrome’s perspective. Yet it contributes almost nothing to what these models know, because it renders its content with JavaScript and the crawler only ever saw an empty shell. That is not a knock on the team or the content. The site was built as a JavaScript app years before anyone knew that crawlers which never run JavaScript would end up deciding what an AI learns, which is exactly what makes it such a clean example. The page is public. It is crawled. Its robots.txt allows it. And it is still effectively absent from the model. If that is true for ChromeStatus, it could be true for a slice of the modern web. Single-page apps, JavaScript-rendered docs, anything that assembles its content in the browser: a crawler can get the URL and come away with nothing but a loading shell. So I went back to Common Crawl, this time not to look up individual pages but to measure how common these blank shells actually are. I streamed a big sample https://github.com/PaulKinlan/url-influence/blob/master/src/cc-shell-confirm.mjs and counted the pages a model would see as blank. Counting by “too little visible text” needs an arbitrary cutoff, so the number I trust uses none: a client-rendered page ships its app mount empty a literal