{"slug": "how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap", "title": "How I built pairwise AI model compare pages with Claude Haiku and a budget cap", "summary": "A technical approach to generating AI model comparison pages for a directory, where the author reduced the potential 19,900 model pairs to a manageable 50 by only comparing the top 4 models within each of 8 pipeline tags. The system uses Claude Haiku for content generation, employs deterministic slug creation for idempotent nightly ETL runs, and caches system prompts to minimize API costs. The author acknowledges that the current approach misses cross-pipeline comparisons (e.g., Whisper vs Gemini-Vision), which would be more valuable for users unfamiliar with the AI landscape.", "body_md": "When I added compare pages to the [Top AI Tools directory](https://dev.to/articles/three-sites-experiment), the first question I had to answer was: how many pairs am I actually looking at? With roughly 200 models across 8 pipeline tags, the naive upper bound is 200 × 199 / 2 ≈ 19,900 pairs. Generating content for each one with Claude Haiku would cost somewhere around $20 per run — not ruinous, but not something I wanted to run daily without thinking carefully.\n\nHere's what I actually built, where it falls short, and what I'd do differently if starting over.\n\n## The combinatorics problem\n\nModel compare pages exist for a specific type of query: \"llama 3 vs mistral 7b\", \"stable diffusion vs sdxl\", \"whisper vs wav2vec2\". These are high-intent queries — the user has already narrowed down to a shortlist and wants a concrete decision nudge. The [static SSG approach I'm running](https://dev.to/articles/static-ssg-vs-dynamic-ai-rendering-directory-seo) means I need to precompute each compare page at build time, which puts pressure on how many pages I can afford to generate.\n\nThe solution I landed on: group by `pipeline_tag`\n\n, pair the top-4 models by download count within each group, then cap total pairs with a `COMPARE_LIMIT`\n\nenv var. Within a single pipeline like `text-generation`\n\n, the top 4 models give 6 pairs (4 choose 2). Across 8 active pipelines that's roughly 48 pairs. The env cap of 50 means I stay within that budget while having room to grow.\n\n``` js\nconst byPipe = new Map<string, typeof models>();\nfor (const m of models) {\n  if (!m.pipeline_tag) continue;\n  const arr = byPipe.get(m.pipeline_tag) ?? [];\n  arr.push(m);\n  byPipe.set(m.pipeline_tag, arr);\n}\n\nconst pairs: Array<[Model, Model]> = [];\nfor (const [, list] of byPipe) {\n  const sorted = [...list].sort((a, b) => b.downloads - a.downloads);\n  const take = sorted.slice(0, Math.min(4, sorted.length));\n  for (let i = 0; i < take.length; i++) {\n    for (let j = i + 1; j < take.length; j++) {\n      pairs.push([take[i]!, take[j]!]);\n    }\n  }\n}\nconst chosen = pairs.slice(0, MAX);\n```\n\nThe pairing happens entirely within pipelines right now, which means I'm covering \"llama vs mistral\" (both `text-generation`\n\n) but not \"whisper vs gemini-vision\" (cross-pipeline). Cross-pipeline comparisons are actually more valuable for users who don't know the landscape yet — that's the next iteration.\n\n## The pair_slug and idempotent inserts\n\nThe slug for each compare pair is constructed deterministically: sort the two model slugs alphabetically, join with `--vs--`\n\n. So whether the ETL processes `(llama-3, mistral-7b)`\n\nor `(mistral-7b, llama-3)`\n\n, the slug is always `llama-3--vs--mistral-7b`\n\n.\n\n``` js\nconst pairSlug = [a.slug, b.slug].sort().join(\"--vs--\");\n```\n\nThis makes the entire ETL idempotent. The script runs every night. If all pairs already exist in the DB, it exits in a couple of seconds without a single Claude call. I check before inserting rather than using `INSERT OR IGNORE`\n\nat the SQL level — the explicit check lets me count skipped vs generated in the same run, which I log:\n\n```\n[compare] done — generated: 3, skipped: 47\n```\n\nThis matters for monitoring. A run that generates 0 and skips 50 is healthy. A run that generates 0 and skips 0 (nothing in DB, nothing processed) would indicate a bug.\n\n## Claude Haiku with system-prompt caching\n\nI reuse the [shared Haiku client I built in week one](https://dev.to/articles/shared-claude-haiku-client-prompt-caching), which handles `cacheSystem: true`\n\non the system prompt. Since the system prompt — the JSON schema instruction — is identical across all compare calls, the first call primes the cache and subsequent calls see near-zero token cost on that prefix.\n\nThe user prompt includes both model names, their authors, pipeline tags, and up to 400 characters of their existing summaries (which come from the earlier content generation step):\n\n``` js\nconst userPrompt = `Compare these two AI models:\nA: ${a.name} (author: ${a.author ?? \"unknown\"}, pipeline: ${a.pipeline_tag ?? \"unknown\"})\n   Summary: ${a.summary?.slice(0, 400) ?? \"(none)\"}\nB: ${b.name} (author: ${b.author ?? \"unknown\"}, pipeline: ${b.pipeline_tag ?? \"unknown\"})\n   Summary: ${b.summary?.slice(0, 400) ?? \"(none)\"}\n\nProduce the JSON comparison.`;\n```\n\nTruncating summaries at 400 characters keeps the user prompt lean. Compare pages are about the *delta* between two models, not a rehash of each model individually. I already have dedicated model pages for depth; the compare page needs to answer \"which one, for what\" — that takes maybe 6 sentences total.\n\nThe system prompt requests a JSON object with `summary`\n\n, `differences`\n\n(array), `similarities`\n\n(array), and `recommendation`\n\n. Keeping the output shape narrow means Haiku rarely wanders off-schema.\n\n## JSON parsing with a regex fence\n\nEven with tight prompting, Haiku occasionally produces JSON with an explanation preamble: \"Here is the comparison:\" followed by the actual object. Strict `JSON.parse`\n\non the raw output would throw. I extract the outermost `{...}`\n\nblock with a regex before parsing:\n\n```\nfunction parseCompare(text: string, fb: CompareData): CompareData {\n  try {\n    const m = text.match(/\\{[\\s\\S]*\\}/);\n    if (!m) return fb;\n    const p = JSON.parse(m[0]);\n    return {\n      summary: typeof p.summary === \"string\" ? p.summary : fb.summary,\n      differences: Array.isArray(p.differences)\n        ? p.differences.map(String)\n        : fb.differences,\n      similarities: Array.isArray(p.similarities)\n        ? p.similarities.map(String)\n        : fb.similarities,\n      recommendation:\n        typeof p.recommendation === \"string\"\n          ? p.recommendation\n          : fb.recommendation,\n    };\n  } catch {\n    return fb;\n  }\n}\n```\n\nEach field is validated individually before being accepted. If `differences`\n\ncomes back as a string (occasional Haiku behavior when it conflates the array with a comma-separated list), the page falls back to the template for that field rather than crashing.\n\nThe fallback struct is worth writing carefully. I spent five minutes on mine and it shows:\n\n``` js\nconst fb: CompareData = {\n  summary: `${a.name} and ${b.name} are both ${a.pipeline_tag} models. See each entry for specifics.`,\n  differences: [\"See individual model pages for architecture and use cases.\"],\n  similarities: [\"Both are open-source models on HuggingFace.\"],\n  recommendation: \"Pick based on your compute budget and specific task requirements.\",\n};\n```\n\nA user landing on a fallback-generated compare page gets a technically-true page that directs them to the model pages rather than a blank or error state. The `model_used`\n\ncolumn in the DB records `\"fallback-template\"`\n\nfor these rows, which I use to identify candidates for regeneration.\n\n## Storage in libSQL and the static JSON dump\n\nCompare data lives in a `model_compare`\n\ntable in [Turso libSQL](https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo), with a unique constraint on `pair_slug`\n\n. After the ETL loop, everything gets dumped to `compare.json`\n\nfor the static build:\n\n``` js\nconst all = await db.execute(\n  `SELECT * FROM model_compare ORDER BY slug_a, slug_b`\n);\nconst entries = all.rows.map((r) => ({\n  slug_a: String(r.slug_a),\n  slug_b: String(r.slug_b),\n  pair_slug: String(r.pair_slug),\n  summary: r.summary ? String(r.summary) : \"\",\n  differences: r.differences ? JSON.parse(String(r.differences)) as string[] : [],\n  similarities: r.similarities ? JSON.parse(String(r.similarities)) as string[] : [],\n  recommendation: r.recommendation ? String(r.recommendation) : \"\",\n}));\nawait writeFile(\"./src/data/compare.json\", JSON.stringify(entries, null, 2));\n```\n\nThe Astro build reads this JSON at build time, generating one static page per pair. No runtime DB calls, no cold starts. The tradeoff is freshness: compare content is up to 24 hours stale. For \"llama 3.1 vs llama 3.2\", that's fine — the models don't change daily.\n\nI validate the JSON-LD on compare pages through the [post-deploy audit CI step](https://dev.to/articles/jsonld-audit-post-deploy-ci) the same way I do for individual model pages. Structured data matters more on comparison queries because those are the exact queries that AI Overviews tend to surface, so getting the schema right is worth the CI overhead.\n\nThe [Astro slug generation](https://dev.to/articles/astro-slug-pages-unique-after-adsense-scaled-content-abuse) for compare pages uses the `pair_slug`\n\ndirectly. The URL pattern is `/compare/llama-3--vs--mistral-7b/`\n\n, which is ugly but unambiguous — the double-dash separator makes it clear this is a two-part slug rather than a hyphen in a model name.\n\n## What I'd change starting over\n\n**Generate cross-pipeline pairs from day one.** The most useful compare queries aren't \"llama 3.1 vs llama 3.2\" — users who care about that distinction already know. The interesting queries are cross-category: \"should I run inference on a text-generation model or use a RAG pipeline?\" I skipped this to stay within the budget cap, but it means I'm missing the long-tail traffic that would actually be differentiated from generic model pages.\n\n**Drive pair selection from search query logs.** Right now I pick pairs by download rank. A better signal would be which pairs users actually search for. Pagefind runs client-side and doesn't log queries to any server, so I'd need a thin logging endpoint — something like a POST to a [GitHub Actions](https://github.com/features/actions)-triggered function that appends to a JSONL file. Then the ETL reads the top-N ungenerated pairs from the log. This is a small amount of infrastructure but it would make the pair selection much more demand-driven.\n\n**Raise the budget cap.** `MAX=50`\n\nis conservative. At current Haiku pricing with prompt caching, 500 pairs would cost roughly $0.10 per nightly run. I was cautious when I set the default, but I've watched the billing closely and the actual spend is a fraction of what I modeled. I'll bump this to 200 in the next ETL config update.\n\nThe [itch.io entries pattern I added to the indie-games directory](https://dev.to/articles/how-i-added-itchio-entries-to-a-steam-only-astro-directory) taught me to plan for the second data source earlier. Compare pages have the same shape: a join between two rows. Getting the abstraction right before you have 500+ rows in the DB is much easier than retrofitting it.\n\n## FAQ\n\n**Does the ETL run every night even when no new models are added?**\n\nYes, but it's nearly free when nothing is new. The check-before-insert means most nights it does 50 DB reads and exits in under 3 seconds without touching the Claude API. The console output shows `generated: 0, skipped: 47`\n\nwhich is the signal that everything is up to date.\n\n**What happens when Claude returns malformed JSON?**\n\n`parseCompare`\n\ncatches the error and returns the fallback struct. The row is still written to the DB with `model_used = \"fallback-template\"`\n\n, which I can query to find rows worth retrying. In practice, this happens on maybe 2-3% of generations — usually when the two models have very sparse metadata and Haiku doesn't have enough context to produce structured output.\n\n**Does the compare.json file get unwieldy as pairs accumulate?**\n\nAt 50 pairs it's roughly 25KB. At 500 pairs it'll be around 250KB — still fine for build-time loading in Astro. If I ever hit 5,000 pairs I'd split the file by `pipeline_tag`\n\nand lazy-import only the relevant subset for each page. For now, one flat JSON file is simpler and fast enough.\n\n**Why not compute compare content at request time with an edge function?**\n\nCold starts and cost. An edge function hit for each compare page view would add 200-500ms of latency (Haiku inference + DB round trip) and would cost much more per-pageview than the nightly batch approach. The content also doesn't need to be fresher than daily — model capabilities don't shift on an hourly basis. Static precomputation is the right tradeoff here, consistent with [the broader bet on static SSG](https://dev.to/articles/static-ssg-vs-dynamic-ai-rendering-directory-seo) I'm running on all three sites.\n\n**How do you handle the case where a model is removed from HuggingFace?**\n\nRight now, I don't. If model `foo`\n\nis deleted from [HuggingFace](https://huggingface.co) but its compare rows are still in the DB, those compare pages will still be served at build time. They'll have the old data until the model's row in `models.json`\n\nis removed — which only happens if the model falls out of the top-500 in the nightly fetch. It's a known gap. For now, the risk is low; popular models don't disappear. A more robust system would cross-reference the compare table against the model table and tombstone orphaned pairs.\n\nRelated: [How I built a shared Claude Haiku client with system-prompt caching](https://dev.to/articles/shared-claude-haiku-client-prompt-caching) | [Turso libSQL vs Cloudflare D1 for an Astro monorepo](https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo)\n\n*Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.*", "url": "https://wpnews.pro/news/how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap", "canonical_source": "https://dev.to/morinaga/how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap-ia0", "published_at": "2026-05-20 22:12:43+00:00", "updated_at": "2026-05-20 22:33:43.207784+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "open-source"], "entities": ["Claude Haiku", "Top AI Tools", "llama 3", "mistral 7b", "stable diffusion", "sdxl", "whisper", "wav2vec2"], "alternates": {"html": "https://wpnews.pro/news/how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap", "markdown": "https://wpnews.pro/news/how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap.md", "text": "https://wpnews.pro/news/how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap.txt", "jsonld": "https://wpnews.pro/news/how-i-built-pairwise-ai-model-compare-pages-with-claude-haiku-and-a-budget-cap.jsonld"}}