Auto-Generating JSON-LD: Page Signals, Type Heuristics, and a Careful Gemini Prompt

A developer built a JSON-LD generation pipeline that prevents AI hallucination by separating fact extraction from schema writing. The system uses deterministic DOM parsing to extract structured signals like author names, publication dates, and prices, then passes only verified facts to a Gemini model for formatting. The model is explicitly forbidden from inventing missing data, instead emitting placeholder tokens for any field without a confirmed value.

The naive version of this tool is one prompt: "Here is a URL, write the JSON-LD for it." We tried that mental model early and threw it out. An LLM handed a bare URL will produce schema that looks perfect and is quietly wrong. It guesses an author when the page has none. It invents a publication date. On a commerce page it will cheerfully write a price that appears nowhere in the markup. The output validates, parses, and ships, and then an AI engine reads a fabricated author name as a confirmed fact. For a tool you paste straight into production, that is the worst possible failure, because it is invisible until something downstream cites the lie. So we built the pipeline backwards from that risk. The model never sees a raw URL and never decides what the facts are. By the time Gemini runs, the page has already been read, the facts have already been extracted, and the page type has already been decided by deterministic code. The model's job is narrow: take known facts and a known shape, and emit well-formed JSON-LD. Everything that could be hallucinated is settled before the model is allowed to write a word. Step one is fetching the page and pulling structured signals out of the DOM. No model here, just parsing. We extract a fixed set of things: src and alt for each, since alt text is the only image content a schema block can carry. rel="author" , the .author and .byline classes, and itemprop="author" . <time elements and the article:published time meta tag. nav ol or any breadcrumb-named class. faq and accordion classes, plus headings phrased as questions.The output of this stage is a plain signal bundle. For a photography tutorial it might look like this: { "title": "Shooting in Manual Mode: A Beginner's Walkthrough", "metaDescription": "Learn aperture, shutter, and ISO in three steps.", "canonical": "https://example.com/blog/manual-mode-walkthrough", "headings": "Step 1: Set your aperture", "Step 2: Pick a shutter speed", "Step 3: Dial in ISO" , "images": { "src": "/img/aperture.jpg", "alt": "Aperture ring on a lens" } , "author": "Dana Okoye", "publishedTime": "2026-04-22", "breadcrumbs": "Home", "Blog", "Photography" , "priceMarkers": false, "faqMarkers": false } Notice what this bundle is: facts, not interpretation. Either author is a string we found in the DOM or it is null. We never fill it. That null is what protects the downstream steps from inventing one. With the signals in hand, we decide the page type ourselves, in plain code, before any model call. It runs as an ordered ladder and stops at the first match: Order matters because the conditions overlap. A product page can also have an author and a date; checking price first means it resolves to product rather than getting misclassified as a blog post three rungs down. The ladder reads top to bottom and the first hit wins. Three reasons this happens before the model, not inside it: Now the model runs. Gemini Flash receives the signal bundle plus the detected type, and returns typed JSON-LD: one block per applicable type, each with a confidence score and a short explanation of why that type fit. We run it in JSON mode, so the response is constrained to valid JSON and our parser never has to scrape a code fence out of prose or recover from a stray sentence the model added. Structured output goes in, structured output comes back. The rule we press hardest in the prompt is on missing facts. The model is instructed that any field it was not handed a value for must be emitted as an explicit, clearly labeled placeholder. It is forbidden to substitute a plausible value of its own. If the signal bundle has author: null , the author field comes back as a placeholder token, never as a name the model decided sounded right. This is the whole safety property of the tool stated as a prompt constraint: a gap stays a visible gap, marked for a human to fill, instead of becoming a confident fabrication that reads as fact. We would rather hand someone a block with three placeholders to complete than one with three invented values to discover later. The model call can fail. The API times out, refuses the request, or returns something we reject. When that happens, the tool does not show an error page and send the user away empty-handed. It still returns the type detection from the heuristic ladder and template JSON-LD blocks for that type, with the same placeholder structure the model would have produced. You lose the model's per-field confidence scoring and its explanation, and you keep a correctly typed skeleton you can fill in by hand. An empty, well-shaped suggestion beats a 500. The deterministic half of the pipeline carries the result on its own when the probabilistic half is unavailable. For the photography tutorial above, with the type resolved to how to and an author present, the generated block looks like this, abbreviated: { "@context": "https://schema.org", "@type": "HowTo", "name": "Shooting in Manual Mode: A Beginner's Walkthrough", "step": { "@type": "HowToStep", "name": "Set your aperture", "text": "PLACEHOLDER STEP DETAIL" }, { "@type": "HowToStep", "name": "Pick a shutter speed", "text": "PLACEHOLDER STEP DETAIL" }, { "@type": "HowToStep", "name": "Dial in ISO", "text": "PLACEHOLDER STEP DETAIL" } , "author": { "@type": "Person", "name": "Dana Okoye" } } The step names came off the headings, the author came off the DOM, and the per-step detail text the page did not expose cleanly is left as a visible placeholder. Nothing in that block is a value the model wished into existence. One caveat before you dismiss the HowTo type: Google retired HowTo rich results in 2023, so this markup earns no badge in the SERP anymore. We keep emitting it anyway, because the GEO use is different. The block still hands an AI engine an ordered procedure it can reproduce faithfully, step for step, without reconstructing the sequence from prose. That is the point here, not chasing a rich-result enhancement that no longer exists. The full output ships as a ready-to-paste <script type="application/ld+json" tag, across the type range the pipeline supports: Article and BlogPosting, Product, FAQPage, LocalBusiness, HowTo, BreadcrumbList, Organization, and WebPage or WebSite. The Schema Generator https://www.getgeology.com/tools/schema-generator runs this whole pipeline on a URL you give it: extract, classify, generate, score. Then, after you fill the placeholders, run the result through the Schema Validator https://www.getgeology.com/tools/schema-validator to confirm your edits did not break the block against schema.org rules. If you take one idea from how this is built, take the ordering. The instinct with a capable model is to hand it the whole problem and admire what comes back. The better discipline is to figure out which decisions must never be probabilistic, classification and fact-finding here, settle those in code, and let the model do only the bounded part that remains. The pipeline is more reliable not because the model is weaker but because we gave it less room to be wrong. Mehul Jain is an AI entrepreneur and product builder. He works on Geology, a GEO platform.