cd /news/artificial-intelligence/how-i-built-pairwise-ai-model-compar… Β· home β€Ί topics β€Ί artificial-intelligence β€Ί article
[ARTICLE Β· art-3991] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=Β· neutral

How I built pairwise AI model compare pages with Claude Haiku and a budget cap

A technical approach to generating AI model comparison pages for a directory, where the author reduced the potential 19,900 model pairs to a manageable 50 by only comparing the top 4 models within each of 8 pipeline tags. The system uses Claude Haiku for content generation, employs deterministic slug creation for idempotent nightly ETL runs, and caches system prompts to minimize API costs. The author acknowledges that the current approach misses cross-pipeline comparisons (e.g., Whisper vs Gemini-Vision), which would be more valuable for users unfamiliar with the AI landscape.

read10 min views6 publishedMay 20, 2026

When I added compare pages to the Top AI Tools directory, the first question I had to answer was: how many pairs am I actually looking at? With roughly 200 models across 8 pipeline tags, the naive upper bound is 200 Γ— 199 / 2 β‰ˆ 19,900 pairs. Generating content for each one with Claude Haiku would cost somewhere around $20 per run β€” not ruinous, but not something I wanted to run daily without thinking carefully.

Here's what I actually built, where it falls short, and what I'd do differently if starting over.

The combinatorics problem #

Model compare pages exist for a specific type of query: "llama 3 vs mistral 7b", "stable diffusion vs sdxl", "whisper vs wav2vec2". These are high-intent queries β€” the user has already narrowed down to a shortlist and wants a concrete decision nudge. The static SSG approach I'm running means I need to precompute each compare page at build time, which puts pressure on how many pages I can afford to generate.

The solution I landed on: group by pipeline_tag

, pair the top-4 models by download count within each group, then cap total pairs with a COMPARE_LIMIT

env var. Within a single pipeline like text-generation

, the top 4 models give 6 pairs (4 choose 2). Across 8 active pipelines that's roughly 48 pairs. The env cap of 50 means I stay within that budget while having room to grow.

const byPipe = new Map<string, typeof models>();
for (const m of models) {
  if (!m.pipeline_tag) continue;
  const arr = byPipe.get(m.pipeline_tag) ?? [];
  arr.push(m);
  byPipe.set(m.pipeline_tag, arr);
}

const pairs: Array<[Model, Model]> = [];
for (const [, list] of byPipe) {
  const sorted = [...list].sort((a, b) => b.downloads - a.downloads);
  const take = sorted.slice(0, Math.min(4, sorted.length));
  for (let i = 0; i < take.length; i++) {
    for (let j = i + 1; j < take.length; j++) {
      pairs.push([take[i]!, take[j]!]);
    }
  }
}
const chosen = pairs.slice(0, MAX);

The pairing happens entirely within pipelines right now, which means I'm covering "llama vs mistral" (both text-generation

) but not "whisper vs gemini-vision" (cross-pipeline). Cross-pipeline comparisons are actually more valuable for users who don't know the landscape yet β€” that's the next iteration.

The pair_slug and idempotent inserts #

The slug for each compare pair is constructed deterministically: sort the two model slugs alphabetically, join with --vs--

. So whether the ETL processes (llama-3, mistral-7b)

or (mistral-7b, llama-3)

, the slug is always llama-3--vs--mistral-7b

.

const pairSlug = [a.slug, b.slug].sort().join("--vs--");

This makes the entire ETL idempotent. The script runs every night. If all pairs already exist in the DB, it exits in a couple of seconds without a single Claude call. I check before inserting rather than using INSERT OR IGNORE

at the SQL level β€” the explicit check lets me count skipped vs generated in the same run, which I log:

[compare] done β€” generated: 3, skipped: 47

This matters for monitoring. A run that generates 0 and skips 50 is healthy. A run that generates 0 and skips 0 (nothing in DB, nothing processed) would indicate a bug.

Claude Haiku with system-prompt caching #

I reuse the shared Haiku client I built in week one, which handles cacheSystem: true

on the system prompt. Since the system prompt β€” the JSON schema instruction β€” is identical across all compare calls, the first call primes the cache and subsequent calls see near-zero token cost on that prefix.

The user prompt includes both model names, their authors, pipeline tags, and up to 400 characters of their existing summaries (which come from the earlier content generation step):

const userPrompt = `Compare these two AI models:
A: ${a.name} (author: ${a.author ?? "unknown"}, pipeline: ${a.pipeline_tag ?? "unknown"})
   Summary: ${a.summary?.slice(0, 400) ?? "(none)"}
B: ${b.name} (author: ${b.author ?? "unknown"}, pipeline: ${b.pipeline_tag ?? "unknown"})
   Summary: ${b.summary?.slice(0, 400) ?? "(none)"}

Produce the JSON comparison.`;

Truncating summaries at 400 characters keeps the user prompt lean. Compare pages are about the delta between two models, not a rehash of each model individually. I already have dedicated model pages for depth; the compare page needs to answer "which one, for what" β€” that takes maybe 6 sentences total.

The system prompt requests a JSON object with summary

, differences

(array), similarities

(array), and recommendation

. Keeping the output shape narrow means Haiku rarely wanders off-schema.

JSON parsing with a regex fence #

Even with tight prompting, Haiku occasionally produces JSON with an explanation preamble: "Here is the comparison:" followed by the actual object. Strict JSON.parse

on the raw output would throw. I extract the outermost {...}

block with a regex before parsing:

function parseCompare(text: string, fb: CompareData): CompareData {
  try {
    const m = text.match(/\{[\s\S]*\}/);
    if (!m) return fb;
    const p = JSON.parse(m[0]);
    return {
      summary: typeof p.summary === "string" ? p.summary : fb.summary,
      differences: Array.isArray(p.differences)
        ? p.differences.map(String)
        : fb.differences,
      similarities: Array.isArray(p.similarities)
        ? p.similarities.map(String)
        : fb.similarities,
      recommendation:
        typeof p.recommendation === "string"
          ? p.recommendation
          : fb.recommendation,
    };
  } catch {
    return fb;
  }
}

Each field is validated individually before being accepted. If differences

comes back as a string (occasional Haiku behavior when it conflates the array with a comma-separated list), the page falls back to the template for that field rather than crashing.

The fallback struct is worth writing carefully. I spent five minutes on mine and it shows:

const fb: CompareData = {
  summary: `${a.name} and ${b.name} are both ${a.pipeline_tag} models. See each entry for specifics.`,
  differences: ["See individual model pages for architecture and use cases."],
  similarities: ["Both are open-source models on HuggingFace."],
  recommendation: "Pick based on your compute budget and specific task requirements.",
};

A user landing on a fallback-generated compare page gets a technically-true page that directs them to the model pages rather than a blank or error state. The model_used

column in the DB records "fallback-template"

for these rows, which I use to identify candidates for regeneration.

Storage in libSQL and the static JSON dump #

Compare data lives in a model_compare

table in Turso libSQL, with a unique constraint on pair_slug

. After the ETL loop, everything gets dumped to compare.json

for the static build:

const all = await db.execute(
  `SELECT * FROM model_compare ORDER BY slug_a, slug_b`
);
const entries = all.rows.map((r) => ({
  slug_a: String(r.slug_a),
  slug_b: String(r.slug_b),
  pair_slug: String(r.pair_slug),
  summary: r.summary ? String(r.summary) : "",
  differences: r.differences ? JSON.parse(String(r.differences)) as string[] : [],
  similarities: r.similarities ? JSON.parse(String(r.similarities)) as string[] : [],
  recommendation: r.recommendation ? String(r.recommendation) : "",
}));
await writeFile("./src/data/compare.json", JSON.stringify(entries, null, 2));

The Astro build reads this JSON at build time, generating one static page per pair. No runtime DB calls, no cold starts. The tradeoff is freshness: compare content is up to 24 hours stale. For "llama 3.1 vs llama 3.2", that's fine β€” the models don't change daily.

I validate the JSON-LD on compare pages through the post-deploy audit CI step the same way I do for individual model pages. Structured data matters more on comparison queries because those are the exact queries that AI Overviews tend to surface, so getting the schema right is worth the CI overhead.

The Astro slug generation for compare pages uses the pair_slug

directly. The URL pattern is /compare/llama-3--vs--mistral-7b/

, which is ugly but unambiguous β€” the double-dash separator makes it clear this is a two-part slug rather than a hyphen in a model name.

What I'd change starting over #

Generate cross-pipeline pairs from day one. The most useful compare queries aren't "llama 3.1 vs llama 3.2" β€” users who care about that distinction already know. The interesting queries are cross-category: "should I run inference on a text-generation model or use a RAG pipeline?" I skipped this to stay within the budget cap, but it means I'm missing the long-tail traffic that would actually be differentiated from generic model pages.

Drive pair selection from search query logs. Right now I pick pairs by download rank. A better signal would be which pairs users actually search for. Pagefind runs client-side and doesn't log queries to any server, so I'd need a thin logging endpoint β€” something like a POST to a GitHub Actions-triggered function that appends to a JSONL file. Then the ETL reads the top-N ungenerated pairs from the log. This is a small amount of infrastructure but it would make the pair selection much more demand-driven.

Raise the budget cap. MAX=50

is conservative. At current Haiku pricing with prompt caching, 500 pairs would cost roughly $0.10 per nightly run. I was cautious when I set the default, but I've watched the billing closely and the actual spend is a fraction of what I modeled. I'll bump this to 200 in the next ETL config update.

The itch.io entries pattern I added to the indie-games directory taught me to plan for the second data source earlier. Compare pages have the same shape: a join between two rows. Getting the abstraction right before you have 500+ rows in the DB is much easier than retrofitting it.

FAQ #

Does the ETL run every night even when no new models are added?

Yes, but it's nearly free when nothing is new. The check-before-insert means most nights it does 50 DB reads and exits in under 3 seconds without touching the Claude API. The console output shows generated: 0, skipped: 47

which is the signal that everything is up to date.

What happens when Claude returns malformed JSON?

parseCompare

catches the error and returns the fallback struct. The row is still written to the DB with model_used = "fallback-template"

, which I can query to find rows worth retrying. In practice, this happens on maybe 2-3% of generations β€” usually when the two models have very sparse metadata and Haiku doesn't have enough context to produce structured output.

Does the compare.json file get unwieldy as pairs accumulate?

At 50 pairs it's roughly 25KB. At 500 pairs it'll be around 250KB β€” still fine for build-time in Astro. If I ever hit 5,000 pairs I'd split the file by pipeline_tag

and lazy-import only the relevant subset for each page. For now, one flat JSON file is simpler and fast enough.

Why not compute compare content at request time with an edge function?

Cold starts and cost. An edge function hit for each compare page view would add 200-500ms of latency (Haiku inference + DB round trip) and would cost much more per-pageview than the nightly batch approach. The content also doesn't need to be fresher than daily β€” model capabilities don't shift on an hourly basis. Static precomputation is the right tradeoff here, consistent with the broader bet on static SSG I'm running on all three sites.

How do you handle the case where a model is removed from HuggingFace?

Right now, I don't. If model foo

is deleted from HuggingFace but its compare rows are still in the DB, those compare pages will still be served at build time. They'll have the old data until the model's row in models.json

is removed β€” which only happens if the model falls out of the top-500 in the nightly fetch. It's a known gap. For now, the risk is low; popular models don't disappear. A more robust system would cross-reference the compare table against the model table and tombstone orphaned pairs.

Related: How I built a shared Claude Haiku client with system-prompt caching | Turso libSQL vs Cloudflare D1 for an Astro monorepo

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

── more in #artificial-intelligence 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/how-i-built-pairwise…] indexed:0 read:10min 2026-05-20 Β· β€”