{"slug": "how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap", "title": "How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit", "summary": "A developer built an open-source alternatives directory using a two-phase ETL pipeline with Turso libSQL and GitHub API. The project required a careful UPSERT strategy to avoid clobbering live star counts with seed data, and used Claude Haiku for generating editorial content with fallback templates.", "body_md": "When I launched [three programmatic directory sites in April 2026](https://dev.to/articles/three-sites-experiment), the open-source alternatives site had the most interesting data model. The AI tools directory indexes HuggingFace models — that's a pull from one API. The indie games directory reads Steam. But the OSS alternatives site has to answer a different question: for this SaaS product, which open-source repos actually cover the same use case, and how do they compare?\n\nGetting that right required a two-phase ETL approach, a careful UPSERT strategy I initially got wrong, and some deliberate choices about where to use Claude Haiku and where to use a fallback template.\n\nThree tables in [Turso libSQL](https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo):\n\n`saas`\n\n— the SaaS tool being replaced (Datadog, Notion, Figma, etc.)`alternatives`\n\n— GitHub repos that serve the same use case, linked by `saas_slug`\n\n`saas_content`\n\n— Claude-generated per-entry text: an intro, comparison notes, and migration tipsThe `alternatives`\n\ntable stores everything the GitHub API returns that matters for a directory: `stars`\n\n, `forks`\n\n, `language`\n\n, `license`\n\n, `last_pushed`\n\n, `description`\n\n. The `saas_content`\n\ntable stores only what Claude adds — the editorial layer that turns raw repo metadata into something useful.\n\nThe full export lives in a JSON file that Astro reads at build time. No database connection at build. The ETL pipeline and the Astro build are separate processes.\n\nThe first time the site runs on a new machine, there's no database. Rather than block a local build on a live GitHub API pass, I wrote a `seed.ts`\n\nscript that bootstraps the database from a hand-curated `saas.json`\n\nfile.\n\nThe JSON contains: SaaS name, slug, homepage, category, and a list of `owner/repo`\n\nstrings. Stars, forks, license, and last_pushed are deliberately omitted — they'll come from the live fetch. What I do include in JSON is pre-polished content for some entries where the Claude default output was weak.\n\n``` js\nfor (const e of entries) {\n  await db.execute({\n    sql: `INSERT INTO saas (slug, name, homepage, category, fetched_at)\n          VALUES (?, ?, ?, ?, ?)\n          ON CONFLICT(slug) DO NOTHING`,\n    args: [e.slug, e.name, e.homepage, e.category, now],\n  });\n\n  for (const a of e.alternatives) {\n    await db.execute({\n      sql: `INSERT INTO alternatives (saas_slug, repo, name, description, ...)\n            VALUES (?, ?, ?, ?, ...)\n            ON CONFLICT(saas_slug, repo) DO NOTHING`,\n      args: [e.slug, a.repo, a.name, a.description, ...],\n    });\n  }\n}\n```\n\n`DO NOTHING`\n\non conflict for `alternatives`\n\nis correct: once GitHub data is live, the seed shouldn't clobber fresh stars counts with the static values from the JSON. But for `saas_content`\n\n, I initially used the same `DO NOTHING`\n\n— and that was a mistake I'll get to below.\n\n`fetch-alternatives.ts`\n\ncalls the GitHub REST API for every `owner/repo`\n\nin the database and upserts the live fields. Unlike the seed, this is `DO UPDATE`\n\n— we want fresh data.\n\nThe [sleep interval is 100ms between GitHub API calls](https://dev.to/articles/sleep-intervals-steam-github-huggingface-etl). For an authenticated token that rate limit is conservative ([GitHub's REST API allows 5000 requests per hour for authenticated users](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api), so 100ms is well under the minimum gap needed). Unauthenticated would be 60 per hour, which is 60 seconds per call — completely impractical at scale. The monorepo authenticates with a secret in GitHub Actions.\n\nErrors per-repo are caught and logged but don't abort the batch:\n\n``` js\nfor (const repoFull of s.alternatives) {\n  const [owner, name] = repoFull.split(\"/\");\n  try {\n    const r = await getRepo(owner, name);\n    await db.execute({\n      sql: `INSERT INTO alternatives (saas_slug, repo, name, description, stars,\n              forks, language, license, last_pushed, url, fetched_at)\n            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)\n            ON CONFLICT(saas_slug, repo) DO UPDATE SET\n              description = excluded.description,\n              stars = excluded.stars,\n              forks = excluded.forks,\n              language = excluded.language,\n              license = excluded.license,\n              last_pushed = excluded.last_pushed,\n              fetched_at = excluded.fetched_at`,\n      args: [\n        s.slug, repoFull, r.name, r.description,\n        r.stargazers_count, r.forks_count,\n        r.language, r.license?.spdx_id ?? null,\n        r.pushed_at, r.html_url, now,\n      ],\n    });\n    await sleep(100);\n  } catch (err) {\n    console.error(`  ! Failed ${repoFull}:`, err instanceof Error ? err.message : err);\n  }\n}\n```\n\nOne field worth noting: `r.license?.spdx_id`\n\nreturns `null`\n\nwhen GitHub sees a license file but can't identify the SPDX identifier. That happens more than you'd expect with non-standard licenses. I render those rows with \"see repo\" instead of a badge so I'm not misleading visitors about the license type.\n\nAfter the GitHub data is fresh, `generate-content.ts`\n\nqueries for SaaS entries that either have no content row or whose `model_used`\n\ncolumn is `'fallback-template'`\n\nor `'seeded-from-json'`\n\n. For each, it asks Claude Haiku for:\n\n`intro`\n\n— 2 sentences on what the SaaS is and why teams seek OSS alternatives`comparison_notes`\n\n— 2-3 sentences on actual tradeoffs (self-hosting overhead, feature gaps)`migration_tips`\n\n— a 2-4 item array of concrete migration stepsI use the [shared Claude Haiku client with system-prompt caching](https://dev.to/articles/shared-claude-haiku-client-prompt-caching) here. The system prompt is identical for every call in a batch, so caching it saves input tokens on all subsequent calls. On a 50-entry pass, the cost difference is real.\n\nThe fallback template — which runs when `ANTHROPIC_API_KEY`\n\nis absent — generates deterministic placeholder text. This matters for CI: the Astro build needs a content row for every SaaS entry. Missing content produces a blank page, which would then trigger [the noindex gate I use for thin programmatic pages](https://dev.to/articles/noindex-gate-programmatic-pages-without-404s).\n\nThe [three-tier content quality ladder](https://dev.to/articles/three-tier-content-quality-ladder-programmatic-etl) I described earlier puts these generated entries at the middle tier — better than the raw repo description, worse than hand-edited content.\n\nOriginal `seed.ts`\n\nfor `saas_content`\n\n:\n\n```\nINSERT INTO saas_content (saas_slug, intro, comparison_notes, migration_tips, generated_at, model_used)\nVALUES (?, ?, ?, ?, ?, ?)\nON CONFLICT(saas_slug) DO NOTHING\n```\n\nThat looked safe. But the problem was subtle. When I seeded with `model_used = null`\n\n(the original JSON had no field), `generate-content.ts`\n\nqueried:\n\n```\nSELECT slug FROM saas s\nLEFT JOIN saas_content c ON c.saas_slug = s.slug\nWHERE c.saas_slug IS NULL\n   OR c.model_used IN ('fallback-template', 'seeded-from-json')\n```\n\nRows seeded with `model_used = null`\n\ndidn't match either condition. They also weren't NULL (the row existed). So they got skipped by the generator — but the seed `DO NOTHING`\n\nalso prevented the polished JSON content from landing, because a fallback-template row had already been written by an earlier run.\n\nThe fix was two parts:\n\n`DO UPDATE`\n\nfor `saas_content`\n\n, not `DO NOTHING`\n\n. Polished JSON content always wins.`model_used`\n\nto be set explicitly — `'seeded-from-json'`\n\nfor automatic entries, `'claude-routine-polish'`\n\nfor hand-checked ones. The generator's WHERE clause excludes both.\n\n```\nON CONFLICT(saas_slug) DO UPDATE SET\n  intro = excluded.intro,\n  comparison_notes = excluded.comparison_notes,\n  migration_tips = excluded.migration_tips,\n  generated_at = excluded.generated_at,\n  model_used = excluded.model_used\n```\n\nThis pattern — using `model_used`\n\nas a status field to coordinate between ETL phases — also showed up in the [AI tools directory's fallback entry upgrade work](https://dev.to/articles/upgrade-fallback-model-entries-deterministic-hash-pool). The lesson there was the same: never let an ETL pass silently skip a row because the status field was written inconsistently.\n\nEach SaaS entry renders as a static page at `/alternatives/[saas]/`\n\n. The renderer reads from `saas.json`\n\n, assembles a grid of alternatives sorted by stars, and inlines the Claude-generated comparison notes. Each entry shows a license badge, language indicator, and last_pushed date formatted as a relative time string.\n\nThe grid intentionally doesn't paginate at the SaaS level. I capped entries per SaaS at 8. More than that becomes noise — the directory's value is curation, not exhaustiveness. The [E-E-A-T transparency pages](https://dev.to/articles/eeat-transparency-pages-programmatic-directory) include a methodology note explaining what that cap means for each category.\n\n**Store raw GitHub JSON alongside derived columns.** Currently each ETL adds derived fields: stars, forks, license, last_pushed. When I later wanted a \"has_recent_releases\" signal, I had to add a full new API call. If I'd kept the raw response in a JSONB/TEXT column, `json_extract(raw, '$.has_wiki')`\n\nwould have been enough.\n\n**Add a deprecated_at field.** When a repo gets deleted or renamed, the ETL call returns a 404 and the code just logs it. The row stays in the database with increasingly stale data. A\n\n`deprecated_at`\n\ntimestamp would let the page renderer show a warning and let the content team decide whether to replace or remove the entry.**Parallelize generate-content with a rate-limit counter.** The current sequential loop takes a noticeable number of minutes on a cold run with 100+ entries. Batching 10 concurrent Haiku calls with a shared counter that throttles at the API limit would be 5-10x faster without touching cost.\n\n**Why Turso instead of a hosted Postgres?**\n\nTurso's edge replicas are in the same regions as Vercel's serverless functions, so read latency is low. The cost for my usage tier is also lower than a comparable Postgres instance. [The full comparison is here](https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo).\n\n**Do you need a paid GitHub plan to avoid rate limits?**\n\nNo. A free personal access token gives 5000 requests per hour — enough to fetch metadata for several hundred repos in a single daily cron run. The 60/hr unauthenticated limit would not work at any meaningful scale.\n\n**How do you prevent Claude costs from escalating?**\n\nSystem-prompt caching amortises the per-call cost across the batch. I also set `max_tokens: 1024`\n\nfor each call, which caps output length. The biggest lever is the `model_used`\n\nstatus field: entries that already have good content don't get regenerated.\n\n**What happens if a GitHub repo is deleted?**\n\nRight now the row goes stale silently. The fetch fails, the error is logged, and the next build still renders the row with whatever data the last successful fetch stored. Adding a 404-specific handler that sets `deprecated_at`\n\nis on the backlog.\n\n*Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.*", "url": "https://wpnews.pro/news/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap", "canonical_source": "https://dev.to/morinaga/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap-i-hit-11ie", "published_at": "2026-06-26 22:12:58+00:00", "updated_at": "2026-06-26 23:04:27.350279+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "ai-tools"], "entities": ["Turso", "libSQL", "GitHub", "Claude Haiku", "HuggingFace", "Steam", "Astro", "GitHub Actions"], "alternates": {"html": "https://wpnews.pro/news/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap", "markdown": "https://wpnews.pro/news/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap.md", "text": "https://wpnews.pro/news/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap.txt", "jsonld": "https://wpnews.pro/news/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap.jsonld"}}