{"slug": "discovering-the-llm-s-curious-and-remarkable-world-knowledge-of-open-data-on-the", "title": "Discovering the LLM's curious and remarkable world knowledge of open data on the web.", "summary": "A software developer discovered that large language models possess detailed knowledge of obscure, primary-source data URLs on the web, including NOAA buoy water temperature data, Socrata 311 civic endpoints, and the Federal Reserve's FRED API. The finding emerged during the design of Plotly Studio, an LLM-powered analytics app, when the developer tested whether the model could locate datasets without user-provided files. This capability could enable direct data retrieval for analysis, bypassing traditional dataset uploads.", "body_md": "# Discovering the LLM's curious and remarkable world knowledge of open data on the web.\n\nMy partner's family is from San Diego and so we frequently drive up and down the coast from San Francisco. It takes about 8-10 hours but we've grown to favor splitting up the drive in half on one side of trip and staying overnight in the little towns stuck in time along the 1 like Pismo Beach and Cayucos.\n\nDuring one of these drives last year, I was in the middle of a particularly industrious\nperiod designing the next version of\n[Plotly Studio](https://plotly.com/plotly-studio) - an LLM-powered analytics\nand visualization app - and our roadtrip turned into a\n[rubber duck discussion](https://en.wikipedia.org/wiki/Rubber_duck_debugging)\nabout its underpinnings and possibilities.\n\nTalking out loud - especially to those who aren't so close to the problem - almost always brings up new ideas. In this case, it was my partner who brought up an idea - and stubbornly held on to it - that fooled me and later surprised me.\n\nWhat if you don't have a dataset? Why can't the app just find the data for you?\n\nThe first step in doing any data analytics or visualization is, of course, uploading or connecting to your dataset.\n\nThe idea that we could just skip this step seemed ridiculous to me at that point. The web is mostly full of unstructured data (documents, text) and in my experience the big open data providers like Kaggle are awash with fabricated datasets (you can tell because all of the data is uniformly distributed). Reliable websites like Wikipedia don't have much in the way of structured datasets and scientific journals are often paywalled or don't include data outside of small tables embedded in PDFs.\n\nSo I shrugged off the idea.\n\nBut then, weeks later, I started just asking open ended questions while prototyping\nwithout providing any dataset of my own.\nAnd I found that Plotly Studio - through its LLM provider - had **a curious\nand specific knowledge of primary source data sources on the web.**\n\nData sources with obscure URLs serving file formats of yesteryear. A data source to seemingly help you any question that you might have about the world.\n\nHere are a few examples that came alive for me in my personal life and interests:\n\n## Water Temperatures Data - NOAA Buoys\n\nI do a fair amount of open water swimming off the coast of San Francisco and was surprised to find plentiful water temperature data courtesy of NOAA's buoys.\n\nThis data is available through these (previously undiscoverable, at least to me!)\nURLs that serve opaque data structures.\nLike all of the examples here, these URLs were not found through web search -\nthey were just in the LLM's world knowledge. Yes, that's right - the LLMs\njust know about these URLs that look like this:\n`https://www.ndbc.noaa.gov/data/realtime2/{station_id}.txt`\n\nand know\nthat the station I'm interested in is probably `46026`\n\n.\n\n## 311 Civic Data - Socrata Endpoints\n\n311 data - the city complaint hotline - is a treasure-trove of data and is remarkably accessible and well known by LLMs.\n\nOne of my favorite queries is to look up recent graffiti complaints in the city as a little underground art tour (one citizen's graffiti complaint is another citizen's masterpiece!).\n\n311 data is available in most major cities. In preparing for a talk I gave in Boston, I plotted the trajectory of the snow storm of the season by tracking 311 complaints about snow.\n\nAt a recent SF meetup, we wondered how likely our cars parked on Valencia St would be to get a parking ticket or not:\n\n## Macro Economic Data - Public FRED API\n\nThe Federal Reserve of St Louis (\"FRED\") posts a ridiculous number of public economics data about seemingly every subject.\n\nThe series names are logical but highly specific, and today's frontier models almost\nknow them all (and for what they don't know, they are aware of FRED's excellent search API).\nCan you guess what series`PCETRIM6M680SFRBDAL`\n\nstands for?\n(\"Dallas Fed's 6-month annualized trimmed mean PCE inflation rate\")\n\nI've developed an odd hobby of asking Studio to fetch data to back up (or counter-act) headlines from newspapers like the Wall Street Journal. As data people we have a particular disdain for the editorial and a fantasy for finding \"the real answer\" in the data. It's invigorating!\n\nAnd asking about broad macro economic questions, like real rent-vs-buy data comparing the houses that my parents bought 30 years ago to housing today.\n\n## An AI grounded in data\n\nThere is a lot of discourse about how LLMs \"average out\" the content on the web due to their very nature. That the web is full of small and funky corners with interesting takes and viewpoints is at risk of collapsing as we interface solely through the everything apps.\n\nI don't disagree. But in this corner of the world of data, I've been delighted to find LLMs surface data sources through obscure APIs that I would have never found, let alone knew they existed in the first place.\n\nAnd the best part is that it's data. Cold, hard, often primary and hopefully trustworthy data. Data that you can examine and graph and interpret and draw your own conclusions to - without any LLM editorializing or smoothing over its point of view or doing the thinking and analysis for you.\n\nWhat an invigorating and refreshing way to interface with the world and these new machines.\n\nSo what other data sources are out there?\nWhat have you always wondered about but never had the data on hand?\n[Let me know](/about) and get in touch.", "url": "https://wpnews.pro/news/discovering-the-llm-s-curious-and-remarkable-world-knowledge-of-open-data-on-the", "canonical_source": "https://chris-parmer.com/discovering-the-llms-curious-and-remarkable-world-knowledge-of-open-data-on-the-web/", "published_at": "2026-05-31 00:00:00+00:00", "updated_at": "2026-06-03 20:17:41.482503+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-products", "ai-tools", "natural-language-processing"], "entities": ["Plotly Studio", "San Diego", "San Francisco", "Pismo Beach", "Cayucos", "Kaggle"], "alternates": {"html": "https://wpnews.pro/news/discovering-the-llm-s-curious-and-remarkable-world-knowledge-of-open-data-on-the", "markdown": "https://wpnews.pro/news/discovering-the-llm-s-curious-and-remarkable-world-knowledge-of-open-data-on-the.md", "text": "https://wpnews.pro/news/discovering-the-llm-s-curious-and-remarkable-world-knowledge-of-open-data-on-the.txt", "jsonld": "https://wpnews.pro/news/discovering-the-llm-s-curious-and-remarkable-world-knowledge-of-open-data-on-the.jsonld"}}