# How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit

> Source: <https://dev.to/morinaga/how-i-built-the-oss-alternatives-directory-github-etl-turso-and-the-upsert-trap-i-hit-11ie>
> Published: 2026-06-26 22:12:58+00:00

When I launched [three programmatic directory sites in April 2026](https://dev.to/articles/three-sites-experiment), the open-source alternatives site had the most interesting data model. The AI tools directory indexes HuggingFace models — that's a pull from one API. The indie games directory reads Steam. But the OSS alternatives site has to answer a different question: for this SaaS product, which open-source repos actually cover the same use case, and how do they compare?

Getting that right required a two-phase ETL approach, a careful UPSERT strategy I initially got wrong, and some deliberate choices about where to use Claude Haiku and where to use a fallback template.

Three tables in [Turso libSQL](https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo):

`saas`

— the SaaS tool being replaced (Datadog, Notion, Figma, etc.)`alternatives`

— GitHub repos that serve the same use case, linked by `saas_slug`

`saas_content`

— Claude-generated per-entry text: an intro, comparison notes, and migration tipsThe `alternatives`

table stores everything the GitHub API returns that matters for a directory: `stars`

, `forks`

, `language`

, `license`

, `last_pushed`

, `description`

. The `saas_content`

table stores only what Claude adds — the editorial layer that turns raw repo metadata into something useful.

The full export lives in a JSON file that Astro reads at build time. No database connection at build. The ETL pipeline and the Astro build are separate processes.

The first time the site runs on a new machine, there's no database. Rather than block a local build on a live GitHub API pass, I wrote a `seed.ts`

script that bootstraps the database from a hand-curated `saas.json`

file.

The JSON contains: SaaS name, slug, homepage, category, and a list of `owner/repo`

strings. Stars, forks, license, and last_pushed are deliberately omitted — they'll come from the live fetch. What I do include in JSON is pre-polished content for some entries where the Claude default output was weak.

``` js
for (const e of entries) {
  await db.execute({
    sql: `INSERT INTO saas (slug, name, homepage, category, fetched_at)
          VALUES (?, ?, ?, ?, ?)
          ON CONFLICT(slug) DO NOTHING`,
    args: [e.slug, e.name, e.homepage, e.category, now],
  });

  for (const a of e.alternatives) {
    await db.execute({
      sql: `INSERT INTO alternatives (saas_slug, repo, name, description, ...)
            VALUES (?, ?, ?, ?, ...)
            ON CONFLICT(saas_slug, repo) DO NOTHING`,
      args: [e.slug, a.repo, a.name, a.description, ...],
    });
  }
}
```

`DO NOTHING`

on conflict for `alternatives`

is correct: once GitHub data is live, the seed shouldn't clobber fresh stars counts with the static values from the JSON. But for `saas_content`

, I initially used the same `DO NOTHING`

— and that was a mistake I'll get to below.

`fetch-alternatives.ts`

calls the GitHub REST API for every `owner/repo`

in the database and upserts the live fields. Unlike the seed, this is `DO UPDATE`

— we want fresh data.

The [sleep interval is 100ms between GitHub API calls](https://dev.to/articles/sleep-intervals-steam-github-huggingface-etl). For an authenticated token that rate limit is conservative ([GitHub's REST API allows 5000 requests per hour for authenticated users](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api), so 100ms is well under the minimum gap needed). Unauthenticated would be 60 per hour, which is 60 seconds per call — completely impractical at scale. The monorepo authenticates with a secret in GitHub Actions.

Errors per-repo are caught and logged but don't abort the batch:

``` js
for (const repoFull of s.alternatives) {
  const [owner, name] = repoFull.split("/");
  try {
    const r = await getRepo(owner, name);
    await db.execute({
      sql: `INSERT INTO alternatives (saas_slug, repo, name, description, stars,
              forks, language, license, last_pushed, url, fetched_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(saas_slug, repo) DO UPDATE SET
              description = excluded.description,
              stars = excluded.stars,
              forks = excluded.forks,
              language = excluded.language,
              license = excluded.license,
              last_pushed = excluded.last_pushed,
              fetched_at = excluded.fetched_at`,
      args: [
        s.slug, repoFull, r.name, r.description,
        r.stargazers_count, r.forks_count,
        r.language, r.license?.spdx_id ?? null,
        r.pushed_at, r.html_url, now,
      ],
    });
    await sleep(100);
  } catch (err) {
    console.error(`  ! Failed ${repoFull}:`, err instanceof Error ? err.message : err);
  }
}
```

One field worth noting: `r.license?.spdx_id`

returns `null`

when GitHub sees a license file but can't identify the SPDX identifier. That happens more than you'd expect with non-standard licenses. I render those rows with "see repo" instead of a badge so I'm not misleading visitors about the license type.

After the GitHub data is fresh, `generate-content.ts`

queries for SaaS entries that either have no content row or whose `model_used`

column is `'fallback-template'`

or `'seeded-from-json'`

. For each, it asks Claude Haiku for:

`intro`

— 2 sentences on what the SaaS is and why teams seek OSS alternatives`comparison_notes`

— 2-3 sentences on actual tradeoffs (self-hosting overhead, feature gaps)`migration_tips`

— a 2-4 item array of concrete migration stepsI use the [shared Claude Haiku client with system-prompt caching](https://dev.to/articles/shared-claude-haiku-client-prompt-caching) here. The system prompt is identical for every call in a batch, so caching it saves input tokens on all subsequent calls. On a 50-entry pass, the cost difference is real.

The fallback template — which runs when `ANTHROPIC_API_KEY`

is absent — generates deterministic placeholder text. This matters for CI: the Astro build needs a content row for every SaaS entry. Missing content produces a blank page, which would then trigger [the noindex gate I use for thin programmatic pages](https://dev.to/articles/noindex-gate-programmatic-pages-without-404s).

The [three-tier content quality ladder](https://dev.to/articles/three-tier-content-quality-ladder-programmatic-etl) I described earlier puts these generated entries at the middle tier — better than the raw repo description, worse than hand-edited content.

Original `seed.ts`

for `saas_content`

:

```
INSERT INTO saas_content (saas_slug, intro, comparison_notes, migration_tips, generated_at, model_used)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(saas_slug) DO NOTHING
```

That looked safe. But the problem was subtle. When I seeded with `model_used = null`

(the original JSON had no field), `generate-content.ts`

queried:

```
SELECT slug FROM saas s
LEFT JOIN saas_content c ON c.saas_slug = s.slug
WHERE c.saas_slug IS NULL
   OR c.model_used IN ('fallback-template', 'seeded-from-json')
```

Rows seeded with `model_used = null`

didn't match either condition. They also weren't NULL (the row existed). So they got skipped by the generator — but the seed `DO NOTHING`

also prevented the polished JSON content from landing, because a fallback-template row had already been written by an earlier run.

The fix was two parts:

`DO UPDATE`

for `saas_content`

, not `DO NOTHING`

. Polished JSON content always wins.`model_used`

to be set explicitly — `'seeded-from-json'`

for automatic entries, `'claude-routine-polish'`

for hand-checked ones. The generator's WHERE clause excludes both.

```
ON CONFLICT(saas_slug) DO UPDATE SET
  intro = excluded.intro,
  comparison_notes = excluded.comparison_notes,
  migration_tips = excluded.migration_tips,
  generated_at = excluded.generated_at,
  model_used = excluded.model_used
```

This pattern — using `model_used`

as a status field to coordinate between ETL phases — also showed up in the [AI tools directory's fallback entry upgrade work](https://dev.to/articles/upgrade-fallback-model-entries-deterministic-hash-pool). The lesson there was the same: never let an ETL pass silently skip a row because the status field was written inconsistently.

Each SaaS entry renders as a static page at `/alternatives/[saas]/`

. The renderer reads from `saas.json`

, assembles a grid of alternatives sorted by stars, and inlines the Claude-generated comparison notes. Each entry shows a license badge, language indicator, and last_pushed date formatted as a relative time string.

The grid intentionally doesn't paginate at the SaaS level. I capped entries per SaaS at 8. More than that becomes noise — the directory's value is curation, not exhaustiveness. The [E-E-A-T transparency pages](https://dev.to/articles/eeat-transparency-pages-programmatic-directory) include a methodology note explaining what that cap means for each category.

**Store raw GitHub JSON alongside derived columns.** Currently each ETL adds derived fields: stars, forks, license, last_pushed. When I later wanted a "has_recent_releases" signal, I had to add a full new API call. If I'd kept the raw response in a JSONB/TEXT column, `json_extract(raw, '$.has_wiki')`

would have been enough.

**Add a deprecated_at field.** When a repo gets deleted or renamed, the ETL call returns a 404 and the code just logs it. The row stays in the database with increasingly stale data. A

`deprecated_at`

timestamp would let the page renderer show a warning and let the content team decide whether to replace or remove the entry.**Parallelize generate-content with a rate-limit counter.** The current sequential loop takes a noticeable number of minutes on a cold run with 100+ entries. Batching 10 concurrent Haiku calls with a shared counter that throttles at the API limit would be 5-10x faster without touching cost.

**Why Turso instead of a hosted Postgres?**

Turso's edge replicas are in the same regions as Vercel's serverless functions, so read latency is low. The cost for my usage tier is also lower than a comparable Postgres instance. [The full comparison is here](https://dev.to/articles/turso-libsql-vs-cloudflare-d1-astro-monorepo).

**Do you need a paid GitHub plan to avoid rate limits?**

No. A free personal access token gives 5000 requests per hour — enough to fetch metadata for several hundred repos in a single daily cron run. The 60/hr unauthenticated limit would not work at any meaningful scale.

**How do you prevent Claude costs from escalating?**

System-prompt caching amortises the per-call cost across the batch. I also set `max_tokens: 1024`

for each call, which caps output length. The biggest lever is the `model_used`

status field: entries that already have good content don't get regenerated.

**What happens if a GitHub repo is deleted?**

Right now the row goes stale silently. The fetch fails, the error is logged, and the next build still renders the row with whatever data the last successful fetch stored. Adding a 404-specific handler that sets `deprecated_at`

is on the backlog.

*Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.*
