How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit

A developer built an open-source alternatives directory using a two-phase ETL pipeline with Turso libSQL and GitHub API. The project required a careful UPSERT strategy to avoid clobbering live star counts with seed data, and used Claude Haiku for generating editorial content with fallback templates.

When I launched three programmatic directory sites in April 2026 https://dev.to/articles/three-sites-experiment , the open-source alternatives site had the most interesting data model. The AI tools directory indexes HuggingFace models — that's a pull from one API. The indie games directory reads Steam. But the OSS alternatives site has to answer a different question: for this SaaS product, which open-source repos actually cover the same use case, and how do they compare? Getting that right required a two-phase ETL approach, a careful UPSERT strategy I initially got wrong, and some deliberate choices about where to use Claude Haiku and where to use a fallback template. Three tables in Turso libSQL https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo : saas — the SaaS tool being replaced Datadog, Notion, Figma, etc. alternatives — GitHub repos that serve the same use case, linked by saas slug saas content — Claude-generated per-entry text: an intro, comparison notes, and migration tipsThe alternatives table stores everything the GitHub API returns that matters for a directory: stars , forks , language , license , last pushed , description . The saas content table stores only what Claude adds — the editorial layer that turns raw repo metadata into something useful. The full export lives in a JSON file that Astro reads at build time. No database connection at build. The ETL pipeline and the Astro build are separate processes. The first time the site runs on a new machine, there's no database. Rather than block a local build on a live GitHub API pass, I wrote a seed.ts script that bootstraps the database from a hand-curated saas.json file. The JSON contains: SaaS name, slug, homepage, category, and a list of owner/repo strings. Stars, forks, license, and last pushed are deliberately omitted — they'll come from the live fetch. What I do include in JSON is pre-polished content for some entries where the Claude default output was weak. js for const e of entries { await db.execute { sql: INSERT INTO saas slug, name, homepage, category, fetched at VALUES ?, ?, ?, ?, ? ON CONFLICT slug DO NOTHING , args: e.slug, e.name, e.homepage, e.category, now , } ; for const a of e.alternatives { await db.execute { sql: INSERT INTO alternatives saas slug, repo, name, description, ... VALUES ?, ?, ?, ?, ... ON CONFLICT saas slug, repo DO NOTHING , args: e.slug, a.repo, a.name, a.description, ... , } ; } } DO NOTHING on conflict for alternatives is correct: once GitHub data is live, the seed shouldn't clobber fresh stars counts with the static values from the JSON. But for saas content , I initially used the same DO NOTHING — and that was a mistake I'll get to below. fetch-alternatives.ts calls the GitHub REST API for every owner/repo in the database and upserts the live fields. Unlike the seed, this is DO UPDATE — we want fresh data. The sleep interval is 100ms between GitHub API calls https://dev.to/articles/sleep-intervals-steam-github-huggingface-etl . For an authenticated token that rate limit is conservative GitHub's REST API allows 5000 requests per hour for authenticated users https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api , so 100ms is well under the minimum gap needed . Unauthenticated would be 60 per hour, which is 60 seconds per call — completely impractical at scale. The monorepo authenticates with a secret in GitHub Actions. Errors per-repo are caught and logged but don't abort the batch: js for const repoFull of s.alternatives { const owner, name = repoFull.split "/" ; try { const r = await getRepo owner, name ; await db.execute { sql: INSERT INTO alternatives saas slug, repo, name, description, stars, forks, language, license, last pushed, url, fetched at VALUES ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? ON CONFLICT saas slug, repo DO UPDATE SET description = excluded.description, stars = excluded.stars, forks = excluded.forks, language = excluded.language, license = excluded.license, last pushed = excluded.last pushed, fetched at = excluded.fetched at , args: s.slug, repoFull, r.name, r.description, r.stargazers count, r.forks count, r.language, r.license?.spdx id ?? null, r.pushed at, r.html url, now, , } ; await sleep 100 ; } catch err { console.error Failed ${repoFull}: , err instanceof Error ? err.message : err ; } } One field worth noting: r.license?.spdx id returns null when GitHub sees a license file but can't identify the SPDX identifier. That happens more than you'd expect with non-standard licenses. I render those rows with "see repo" instead of a badge so I'm not misleading visitors about the license type. After the GitHub data is fresh, generate-content.ts queries for SaaS entries that either have no content row or whose model used column is 'fallback-template' or 'seeded-from-json' . For each, it asks Claude Haiku for: intro — 2 sentences on what the SaaS is and why teams seek OSS alternatives comparison notes — 2-3 sentences on actual tradeoffs self-hosting overhead, feature gaps migration tips — a 2-4 item array of concrete migration stepsI use the shared Claude Haiku client with system-prompt caching https://dev.to/articles/shared-claude-haiku-client-prompt-caching here. The system prompt is identical for every call in a batch, so caching it saves input tokens on all subsequent calls. On a 50-entry pass, the cost difference is real. The fallback template — which runs when ANTHROPIC API KEY is absent — generates deterministic placeholder text. This matters for CI: the Astro build needs a content row for every SaaS entry. Missing content produces a blank page, which would then trigger the noindex gate I use for thin programmatic pages https://dev.to/articles/noindex-gate-programmatic-pages-without-404s . The three-tier content quality ladder https://dev.to/articles/three-tier-content-quality-ladder-programmatic-etl I described earlier puts these generated entries at the middle tier — better than the raw repo description, worse than hand-edited content. Original seed.ts for saas content : INSERT INTO saas content saas slug, intro, comparison notes, migration tips, generated at, model used VALUES ?, ?, ?, ?, ?, ? ON CONFLICT saas slug DO NOTHING That looked safe. But the problem was subtle. When I seeded with model used = null the original JSON had no field , generate-content.ts queried: SELECT slug FROM saas s LEFT JOIN saas content c ON c.saas slug = s.slug WHERE c.saas slug IS NULL OR c.model used IN 'fallback-template', 'seeded-from-json' Rows seeded with model used = null didn't match either condition. They also weren't NULL the row existed . So they got skipped by the generator — but the seed DO NOTHING also prevented the polished JSON content from landing, because a fallback-template row had already been written by an earlier run. The fix was two parts: DO UPDATE for saas content , not DO NOTHING . Polished JSON content always wins. model used to be set explicitly — 'seeded-from-json' for automatic entries, 'claude-routine-polish' for hand-checked ones. The generator's WHERE clause excludes both. ON CONFLICT saas slug DO UPDATE SET intro = excluded.intro, comparison notes = excluded.comparison notes, migration tips = excluded.migration tips, generated at = excluded.generated at, model used = excluded.model used This pattern — using model used as a status field to coordinate between ETL phases — also showed up in the AI tools directory's fallback entry upgrade work https://dev.to/articles/upgrade-fallback-model-entries-deterministic-hash-pool . The lesson there was the same: never let an ETL pass silently skip a row because the status field was written inconsistently. Each SaaS entry renders as a static page at /alternatives/ saas / . The renderer reads from saas.json , assembles a grid of alternatives sorted by stars, and inlines the Claude-generated comparison notes. Each entry shows a license badge, language indicator, and last pushed date formatted as a relative time string. The grid intentionally doesn't paginate at the SaaS level. I capped entries per SaaS at 8. More than that becomes noise — the directory's value is curation, not exhaustiveness. The E-E-A-T transparency pages https://dev.to/articles/eeat-transparency-pages-programmatic-directory include a methodology note explaining what that cap means for each category. Store raw GitHub JSON alongside derived columns. Currently each ETL adds derived fields: stars, forks, license, last pushed. When I later wanted a "has recent releases" signal, I had to add a full new API call. If I'd kept the raw response in a JSONB/TEXT column, json extract raw, '$.has wiki' would have been enough. Add a deprecated at field. When a repo gets deleted or renamed, the ETL call returns a 404 and the code just logs it. The row stays in the database with increasingly stale data. A deprecated at timestamp would let the page renderer show a warning and let the content team decide whether to replace or remove the entry. Parallelize generate-content with a rate-limit counter. The current sequential loop takes a noticeable number of minutes on a cold run with 100+ entries. Batching 10 concurrent Haiku calls with a shared counter that throttles at the API limit would be 5-10x faster without touching cost. Why Turso instead of a hosted Postgres? Turso's edge replicas are in the same regions as Vercel's serverless functions, so read latency is low. The cost for my usage tier is also lower than a comparable Postgres instance. The full comparison is here https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo . Do you need a paid GitHub plan to avoid rate limits? No. A free personal access token gives 5000 requests per hour — enough to fetch metadata for several hundred repos in a single daily cron run. The 60/hr unauthenticated limit would not work at any meaningful scale. How do you prevent Claude costs from escalating? System-prompt caching amortises the per-call cost across the batch. I also set max tokens: 1024 for each call, which caps output length. The biggest lever is the model used status field: entries that already have good content don't get regenerated. What happens if a GitHub repo is deleted? Right now the row goes stale silently. The fetch fails, the error is logged, and the next build still renders the row with whatever data the last successful fetch stored. Adding a 404-specific handler that sets deprecated at is on the backlog. Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.