{"slug": "auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt", "title": "Auto-Generating JSON-LD: Page Signals, Type Heuristics, and a Careful Gemini Prompt", "summary": "A developer built a JSON-LD generation pipeline that prevents AI hallucination by separating fact extraction from schema writing. The system uses deterministic DOM parsing to extract structured signals like author names, publication dates, and prices, then passes only verified facts to a Gemini model for formatting. The model is explicitly forbidden from inventing missing data, instead emitting placeholder tokens for any field without a confirmed value.", "body_md": "The naive version of this tool is one prompt: \"Here is a URL, write the JSON-LD for it.\" We tried that mental model early and threw it out. An LLM handed a bare URL will produce schema that looks perfect and is quietly wrong. It guesses an author when the page has none. It invents a publication date. On a commerce page it will cheerfully write a price that appears nowhere in the markup. The output validates, parses, and ships, and then an AI engine reads a fabricated author name as a confirmed fact. For a tool you paste straight into production, that is the worst possible failure, because it is invisible until something downstream cites the lie.\n\nSo we built the pipeline backwards from that risk. The model never sees a raw URL and never decides what the facts are. By the time Gemini runs, the page has already been read, the facts have already been extracted, and the page type has already been decided by deterministic code. The model's job is narrow: take known facts and a known shape, and emit well-formed JSON-LD. Everything that could be hallucinated is settled before the model is allowed to write a word.\n\nStep one is fetching the page and pulling structured signals out of the DOM. No model here, just parsing. We extract a fixed set of things:\n\n`src`\n\nand `alt`\n\nfor each, since alt text is the only image content a schema block can carry.`rel=\"author\"`\n\n, the `.author`\n\nand `.byline`\n\nclasses, and `itemprop=\"author\"`\n\n.`<time>`\n\nelements and the `article:published_time`\n\nmeta tag.`nav ol`\n\nor any breadcrumb-named class.`faq`\n\nand `accordion`\n\nclasses, plus headings phrased as questions.The output of this stage is a plain signal bundle. For a photography tutorial it might look like this:\n\n```\n{\n  \"title\": \"Shooting in Manual Mode: A Beginner's Walkthrough\",\n  \"metaDescription\": \"Learn aperture, shutter, and ISO in three steps.\",\n  \"canonical\": \"https://example.com/blog/manual-mode-walkthrough\",\n  \"headings\": [\"Step 1: Set your aperture\", \"Step 2: Pick a shutter speed\", \"Step 3: Dial in ISO\"],\n  \"images\": [{ \"src\": \"/img/aperture.jpg\", \"alt\": \"Aperture ring on a lens\" }],\n  \"author\": \"Dana Okoye\",\n  \"publishedTime\": \"2026-04-22\",\n  \"breadcrumbs\": [\"Home\", \"Blog\", \"Photography\"],\n  \"priceMarkers\": false,\n  \"faqMarkers\": false\n}\n```\n\nNotice what this bundle is: facts, not interpretation. Either `author`\n\nis a string we found in the DOM or it is null. We never fill it. That null is what protects the downstream steps from inventing one.\n\nWith the signals in hand, we decide the page type ourselves, in plain code, before any model call. It runs as an ordered ladder and stops at the first match:\n\nOrder matters because the conditions overlap. A product page can also have an author and a date; checking price first means it resolves to product rather than getting misclassified as a blog post three rungs down. The ladder reads top to bottom and the first hit wins.\n\nThree reasons this happens before the model, not inside it:\n\nNow the model runs. Gemini Flash receives the signal bundle plus the detected type, and returns typed JSON-LD: one block per applicable type, each with a confidence score and a short explanation of why that type fit. We run it in JSON mode, so the response is constrained to valid JSON and our parser never has to scrape a code fence out of prose or recover from a stray sentence the model added. Structured output goes in, structured output comes back.\n\nThe rule we press hardest in the prompt is on missing facts. The model is instructed that any field it was not handed a value for must be emitted as an explicit, clearly labeled placeholder. It is forbidden to substitute a plausible value of its own. If the signal bundle has `author: null`\n\n, the author field comes back as a placeholder token, never as a name the model decided sounded right. This is the whole safety property of the tool stated as a prompt constraint: a gap stays a visible gap, marked for a human to fill, instead of becoming a confident fabrication that reads as fact. We would rather hand someone a block with three placeholders to complete than one with three invented values to discover later.\n\nThe model call can fail. The API times out, refuses the request, or returns something we reject. When that happens, the tool does not show an error page and send the user away empty-handed. It still returns the type detection from the heuristic ladder and template JSON-LD blocks for that type, with the same placeholder structure the model would have produced. You lose the model's per-field confidence scoring and its explanation, and you keep a correctly typed skeleton you can fill in by hand. An empty, well-shaped suggestion beats a 500. The deterministic half of the pipeline carries the result on its own when the probabilistic half is unavailable.\n\nFor the photography tutorial above, with the type resolved to `how_to`\n\nand an author present, the generated block looks like this, abbreviated:\n\n```\n{\n  \"@context\": \"https://schema.org\",\n  \"@type\": \"HowTo\",\n  \"name\": \"Shooting in Manual Mode: A Beginner's Walkthrough\",\n  \"step\": [\n    { \"@type\": \"HowToStep\", \"name\": \"Set your aperture\", \"text\": \"PLACEHOLDER_STEP_DETAIL\" },\n    { \"@type\": \"HowToStep\", \"name\": \"Pick a shutter speed\", \"text\": \"PLACEHOLDER_STEP_DETAIL\" },\n    { \"@type\": \"HowToStep\", \"name\": \"Dial in ISO\", \"text\": \"PLACEHOLDER_STEP_DETAIL\" }\n  ],\n  \"author\": { \"@type\": \"Person\", \"name\": \"Dana Okoye\" }\n}\n```\n\nThe step names came off the headings, the author came off the DOM, and the per-step detail text the page did not expose cleanly is left as a visible placeholder. Nothing in that block is a value the model wished into existence.\n\nOne caveat before you dismiss the HowTo type: Google retired HowTo rich results in 2023, so this markup earns no badge in the SERP anymore. We keep emitting it anyway, because the GEO use is different. The block still hands an AI engine an ordered procedure it can reproduce faithfully, step for step, without reconstructing the sequence from prose. That is the point here, not chasing a rich-result enhancement that no longer exists. The full output ships as a ready-to-paste `<script type=\"application/ld+json\">`\n\ntag, across the type range the pipeline supports: Article and BlogPosting, Product, FAQPage, LocalBusiness, HowTo, BreadcrumbList, Organization, and WebPage or WebSite.\n\nThe [Schema Generator](https://www.getgeology.com/tools/schema-generator) runs this whole pipeline on a URL you give it: extract, classify, generate, score. Then, after you fill the placeholders, run the result through the [Schema Validator](https://www.getgeology.com/tools/schema-validator) to confirm your edits did not break the block against schema.org rules.\n\nIf you take one idea from how this is built, take the ordering. The instinct with a capable model is to hand it the whole problem and admire what comes back. The better discipline is to figure out which decisions must never be probabilistic, classification and fact-finding here, settle those in code, and let the model do only the bounded part that remains. The pipeline is more reliable not because the model is weaker but because we gave it less room to be wrong.\n\n*Mehul Jain is an AI entrepreneur and product builder. He works on Geology, a GEO platform.*", "url": "https://wpnews.pro/news/auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt", "canonical_source": "https://dev.to/geology_ai/auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt-243a", "published_at": "2026-06-04 16:23:22+00:00", "updated_at": "2026-06-04 16:42:49.894650+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-tools", "ai-research", "ai-ethics"], "entities": ["Gemini", "JSON-LD", "LLM"], "alternates": {"html": "https://wpnews.pro/news/auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt", "markdown": "https://wpnews.pro/news/auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt.md", "text": "https://wpnews.pro/news/auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt.txt", "jsonld": "https://wpnews.pro/news/auto-generating-json-ld-page-signals-type-heuristics-and-a-careful-gemini-prompt.jsonld"}}