{"slug": "building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture", "title": "Building an AI Visibility Scanner: Hybrid AI Analysis Architecture", "summary": "A developer built GetCiteFlow, an AI visibility scanner that uses a hybrid analysis architecture combining LLM evaluation with deterministic checks to measure how well websites are cited by AI search engines like ChatGPT and Claude. The tool analyzes six dimensions including AI visibility, FAQ coverage, and entity clarity, addressing the gap where traditional SEO metrics have only a ~0.3 correlation with AI citation rates.", "body_md": "If you've been following the AI space, you've likely noticed the shift: users are no longer just \"Googling it.\" They're asking ChatGPT, Perplexity, Claude, and Gemini directly. This changes everything about how content gets discovered — and it's a problem most site owners haven't even realized they have.\n\nTraditional SEO metrics (backlinks, domain authority, keyword stuffing) have only a ~0.3 correlation with AI citation rates. A site that ranks #1 on Google can be completely invisible to ChatGPT. This is the gap **Generative Engine Optimization (GEO)** fills.\n\nIn this article, I'll walk through what GEO actually means from a technical perspective, then dive into a real implementation — using [GetCiteFlow](https://www.getciteflow.ai), the AI visibility scanner I built — with code, architecture decisions, and lessons learned.\n\nWhen an AI like ChatGPT or Claude answers a user query, it doesn't \"rank\" pages the way Google does. Instead, it looks for signals that make content **easy to cite, summarize, and attribute**.\n\nThrough our analysis of thousands of sites, we found six dimensions that matter most:\n\n| Dimension | What It Measures |\n|---|---|\nAI Visibility |\nCan the AI find and parse your content? |\nFAQ Coverage |\nDo you have structured FAQ schema? |\nEntity Clarity |\nDoes the page clearly define what it is? |\nAuthority |\nIs there original research or named authors? |\nContent Structure |\nAre lists, tables, and headings being used? |\nSummary Optimization |\nIs there a clear summary for AI to extract? |\n\nThe key insight: **AI search engines don't read pages the way humans do.** They look for machine-readable signals — structured data, entity definitions, `llms.txt`\n\nfiles — not just keyword density.\n\nGetCiteFlow uses a **hybrid analysis architecture**. Instead of relying solely on an LLM to evaluate a site (which can hallucinate), we combine two independent analysis layers:\n\n```\nUser enters URL\n      |\n      v\n  [1] Scrape site → extract signals (HTML parsing)\n      |\n      v\n  [2] Format signals → send to AI (Gemini/OpenAI/Deepseek)\n      |\n      v\n  [3] AI returns structured JSON (score, breakdown, suggestions)\n      |\n      v\n  [4] Merge with deterministic checks (lists, meta length, etc.)\n      |\n      v\n  [5] Cache result + render report\n```\n\nHere's the core orchestration function from `lib/analyze.ts`\n\n:\n\n```\nexport async function analyzeSite(url: string): Promise<Record<string, unknown>> {\n  const cacheKey = `report:${url}`;\n  const cached = cacheGet<Record<string, unknown>>(cacheKey);\n  if (cached) return { ...cached, cached: true };\n\n  // Deduplicate concurrent requests to the same URL\n  if (pendingCache.has(cacheKey)) {\n    return pendingCache.get(cacheKey)!;\n  }\n\n  const analyze = async () => {\n    const activeProvider = getProvider();\n    const fn = providerFns[activeProvider];\n    const report = await fn(url);\n\n    // Merge AI results with deterministic signal detection\n    const siteData = await getSiteData(url);\n    const deterministicMissing = getDeterministicMissing(siteData);\n    const aiMissing = (report.missing as string[]) || [];\n    const mergedMissing = [...new Set([...aiMissing, ...deterministicMissing])];\n\n    const result = { ...report, missing: mergedMissing };\n    cacheSet(cacheKey, result, CACHE_TTL_MS);\n    return result;\n  };\n\n  const promise = analyze();\n  pendingCache.set(cacheKey, promise);\n  return promise;\n}\n```\n\nWhy hybrid? **LLMs are great at qualitative judgment but bad at counting.** An AI might miss that a site has no `<ul>`\n\ntags, but a simple regex check won't. By combining both, we get the best of both worlds.\n\nThe scraper (`lib/scrape.ts`\n\n) is a pure-HTTP fetcher — no headless browser. It fetches the HTML, parses structured signals using regex, and checks for critical static files.\n\n``` js\nasync function extractFromHtml(html: string) {\n  const titleMatch = /<title[^>]*>([^<]*)<\\/title>/i.exec(html);\n  const hasOpenGraph =\n    /<meta[^>]+property=[\"']og:(title|description|image)[\"']/i.test(html);\n  const hasFaqSchema = /* parses JSON-LD <script> blocks */;\n  const hasOrderedLists = /<ol[\\s>]/i.test(html);\n  const avgParagraphLength = /* calculates from <p> tags */;\n  const hasSummarySection =\n    /\\b(key takeaways?|executive summary|tldr|tl;dr)\\b/i.test(bodyLower);\n\n  return { title, hasOpenGraph, hasFaqSchema, hasOrderedLists, ... };\n}\n```\n\nWe also check three critical files in parallel:\n\n```\nconst [hasRobotsTxt, hasSitemap, hasLlmstxt] = await Promise.all([\n  checkStaticFile(resolvedOrigin, \"/robots.txt\"),\n  checkStaticFile(resolvedOrigin, \"/sitemap.xml\"),\n  checkStaticFile(resolvedOrigin, \"/llms.txt\"),\n]);\n```\n\nThe `llms.txt`\n\ncheck is particularly important — it's a relatively new standard (proposed by the llmstxt community) that creates a machine-readable site index specifically for AI crawlers. Sites with an `llms.txt`\n\nfile get significantly better AI citation rates.\n\nFor the AI analysis, we use Google Gemini's native structured output support. This is critical — without it, parsing free-form JSON from an LLM is fragile and error-prone.\n\n``` js\nasync function analyzeWithGemini(url: string) {\n  const siteData = formatSiteData(url, await getSiteData(url));\n  const response = await getAiClient().models.generateContent({\n    model: \"gemini-3-flash-preview\",\n    contents: ANALYZE_PROMPT(url, siteData),\n    config: {\n      temperature: 0,           // deterministic output\n      responseMimeType: \"application/json\",\n      responseSchema: {\n        type: Type.OBJECT,\n        required: [\"score\", \"breakdown\", \"missing\", \"suggestions\", \"summary\"],\n        properties: {\n          score: { type: Type.NUMBER },\n          breakdown: {\n            type: Type.OBJECT,\n            properties: {\n              aiVisibility: { type: Type.NUMBER },\n              faqCoverage: { type: Type.NUMBER },\n              entityClarity: { type: Type.NUMBER },\n              authority: { type: Type.NUMBER },\n              contentStructure: { type: Type.NUMBER },\n              summaryOptimization: { type: Type.NUMBER },\n            },\n          },\n          missing: { type: Type.ARRAY, items: { type: Type.STRING } },\n          suggestions: { type: Type.ARRAY, items: { type: Type.STRING } },\n          summary: { type: Type.STRING },\n        },\n      },\n    },\n  });\n\n  return JSON.parse(response.text || \"{}\");\n}\n```\n\nKey design decisions here:\n\n`temperature: 0`\n\n`responseSchema`\n\nHere's the prompt template (`lib/ai-provider.ts`\n\n):\n\n``` js\nexport const ANALYZE_PROMPT = (url: string, siteData?: string) =>\n  `Analyze the AI visibility (GEO) of the website: ${url}.\n\n${siteData ? `Here are the actual signals detected from the website:\\n${siteData}\\n\\nBase your analysis on these real signals rather than guessing.` : ''}\nEvaluate these factors specifically using the signals above:\n- contentStructure (0-100): How well the content is structured for AI parsing...\n- summaryOptimization (0-100): How optimized the page is for AI summarization...\n\nReturn ONLY a JSON object with these exact keys:\n{ \"score\": <number 0-100>, \"breakdown\": { ... }, \"missing\": [...], \"suggestions\": [...], \"summary\": \"...\" }`;\n```\n\nWe also support OpenAI and Deepseek as fallback providers, switched via the `AI_PROVIDER_DEFAULT`\n\nenvironment variable. The architecture makes adding new providers trivial — just implement the same function signature.\n\nSince every analysis hits an LLM API (costly) and scrapes a site (slow), caching is essential. We use a simple in-memory `Map`\n\nwith 1-hour TTL:\n\n``` js\ninterface CacheEntry<T> {\n  data: T;\n  expiresAt: number;\n}\n\nconst store = new Map<string, CacheEntry<unknown>>();\nconst CLEAN_INTERVAL = 60_000;\nlet lastClean = 0;\n\nfunction clean() {\n  const now = Date.now();\n  if (now - lastClean < CLEAN_INTERVAL) return;\n  lastClean = now;\n  for (const [key, entry] of store) {\n    if (now > entry.expiresAt) store.delete(key);\n  }\n}\n\nexport function cacheGet<T>(key: string): T | null {\n  clean();\n  const entry = store.get(key);\n  if (!entry || Date.now() > entry.expiresAt) return null;\n  return entry.data as T;\n}\n\nexport function cacheSet<T>(key: string, data: T, ttlMs: number): void {\n  store.set(key, { data, expiresAt: Date.now() + ttlMs });\n}\n```\n\nWe also use a **pending cache** (`pendingCache`\n\nin `analyze.ts`\n\n) to deduplicate concurrent requests for the same URL — so if two users submit the same URL simultaneously, only one analysis runs:\n\n``` js\nconst pendingCache = new Map<string, Promise<Record<string, unknown>>>();\n// ...\nif (pendingCache.has(cacheKey)) {\n  return pendingCache.get(cacheKey)!;  // wait for in-flight request\n}\n```\n\nFor production, you'd want Redis or another distributed cache. This in-memory approach works well for single-instance deployments (like Vercel's serverless functions with concurrency).\n\nWe use Upstash Redis for rate limiting with a sliding window. The critical design choice: **fail open when Redis is unavailable**, not fail closed.\n\n``` js\nlet ratelimit: Ratelimit | null = null;\n\ntry {\n  const redis = Redis.fromEnv();\n  ratelimit = new Ratelimit({\n    redis,\n    limiter: Ratelimit.slidingWindow(max, \"1 h\"),\n    analytics: true,\n    prefix: \"@citeflow/ratelimit\",\n  });\n} catch {\n  // Redis init failed — fall through, rate limiting is degraded\n}\n\nexport async function checkRateLimit(ip: string): Promise<RateLimitResult> {\n  if (!ratelimit) {\n    return { success: true };  // allow request when Redis is down\n  }\n  try {\n    const { success } = await ratelimit.limit(ip);\n    return success\n      ? { success: true }\n      : { success: false, reason: 'rate_limited' };\n  } catch {\n    return { success: false, reason: 'redis_unavailable' };\n  }\n}\n```\n\nWhy fail open? Because the free-tier tool is meant to be accessible. Blocking all users because Redis is having a bad day is worse than temporarily bypassing rate limits for a few requests.\n\nThe report page at `app/report/[domain]/page.tsx`\n\nuses **Server-Side Rendering (SSR)** with `maxDuration: 60`\n\n(Vercel's timeout for Pro plans). This is necessary because:\n\n``` js\nexport const maxDuration = 60;\n\nexport default async function ReportPage({ params }) {\n  const { domain } = await params;\n  const ip = getClientIp(headers());\n\n  const result = await getReport(domain, ip);\n\n  if (!result.ok) {\n    // Render error states: rate_limited, timeout, failed\n    return <ErrorState reason={result.reason} />;\n  }\n\n  return <ReportView data={result.data} />;\n}\n```\n\nWe also generate dynamic OG images per report using the Edge runtime:\n\n``` js\n// app/api/og/route.tsx — runs on Vercel Edge\nexport const runtime = 'edge';\nexport const dynamic = 'force-dynamic';\n```\n\nThis means every report page has a unique social preview showing the domain and score — critical for shareability on X/Twitter and LinkedIn.\n\nThe first version of our AI analysis didn't feed real scraped signals into the prompt. The AI made up plausible-sounding but completely wrong assessments. **Always provide ground-truth data in the prompt and instruct the model to base its analysis on that data.**\n\nBefore Gemini supported `responseSchema`\n\n, we used `\"output valid JSON only\"`\n\nin the prompt. It worked ~70% of the time. With structured output, it's ~99.9%. Use native structured output whenever your provider supports it.\n\nLLM API calls are expensive ($0.15–$3.00 per million tokens) and slow (2–5 seconds). An in-memory cache with request deduplication eliminated redundant calls entirely. For Vercel deployments with multiple concurrent invocations, the `pendingCache`\n\npattern is essential.\n\nThe AI often missed simple things like \"no lists on the page\" or \"meta description is too short.\" These are trivial to detect with regex but easy for an LLM to gloss over. The hybrid approach catches both.\n\n`llms.txt`\n\nis a proposed standard, not a W3C spec. FAQ Schema behavior in AI search changes monthly. Building this kind of tool means constantly iterating as the ecosystem evolves. We treat our signal detection as a pluggable layer that can be updated independently of the AI analysis.\n\nThe full source of this architecture is running at [GetCiteFlow](https://www.getciteflow.ai) — feel free to test your own site and see how the analysis works end-to-end. The tech stack: Next.js 15 (App Router, SSR, Edge Functions), React 19 with Tailwind CSS 4 + shadcn/ui, Google Gemini for AI analysis (OpenAI/Deepseek fallbacks), Upstash Redis for rate limiting, deployed on Vercel.\n\n*GEO is still the early days — much like SEO was in 1998. The sites that optimize for AI search today will have a compound advantage as AI assistants become the primary interface for information discovery.*\n\n*If you're building something in this space or have questions about the architecture, I'd love to hear from you. Leave a comment below or reach out on X/Twitter.*\n\n*Built by GetCiteFlow — AI visibility analysis for the AI-search era.*", "url": "https://wpnews.pro/news/building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture", "canonical_source": "https://dev.to/neilyan/building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture-ccg", "published_at": "2026-06-16 08:46:30+00:00", "updated_at": "2026-06-16 08:47:09.653377+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-products", "ai-tools", "developer-tools"], "entities": ["GetCiteFlow", "ChatGPT", "Claude", "Perplexity", "Gemini", "OpenAI", "Deepseek", "Google"], "alternates": {"html": "https://wpnews.pro/news/building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture", "markdown": "https://wpnews.pro/news/building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture.md", "text": "https://wpnews.pro/news/building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture.txt", "jsonld": "https://wpnews.pro/news/building-an-ai-visibility-scanner-hybrid-ai-analysis-architecture.jsonld"}}