Auto-Generating JSON-LD: Page Signals, Type Heuristics, and a Careful Gemini Prompt A developer built a JSON-LD generation pipeline that prevents AI hallucination by separating fact extraction from schema writing. The system uses deterministic DOM parsing to extract structured signals like author names, publication dates, and prices, then passes only verified facts to a Gemini model for formatting. The model is explicitly forbidden from inventing missing data, instead emitting placeholder tokens for any field without a confirmed value. The naive version of this tool is one prompt: "Here is a URL, write the JSON-LD for it." We tried that mental model early and threw it out. An LLM handed a bare URL will produce schema that looks perfect and is quietly wrong. It guesses an author when the page has none. It invents a publication date. On a commerce page it will cheerfully write a price that appears nowhere in the markup. The output validates, parses, and ships, and then an AI engine reads a fabricated author name as a confirmed fact. For a tool you paste straight into production, that is the worst possible failure, because it is invisible until something downstream cites the lie. So we built the pipeline backwards from that risk. The model never sees a raw URL and never decides what the facts are. By the time Gemini runs, the page has already been read, the facts have already been extracted, and the page type has already been decided by deterministic code. The model's job is narrow: take known facts and a known shape, and emit well-formed JSON-LD. Everything that could be hallucinated is settled before the model is allowed to write a word. Step one is fetching the page and pulling structured signals out of the DOM. No model here, just parsing. We extract a fixed set of things: src and alt for each, since alt text is the only image content a schema block can carry. rel="author" , the .author and .byline classes, and itemprop="author" .