cd /news/large-language-models/show-hn-autosynth-generating-synthet… · home topics large-language-models article
[ARTICLE · art-47760] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Show HN: Autosynth – generating synthetic data with strong/weak model filtering

Autosynth, a new open-source tool for generating synthetic datasets using an LLM loop that proposes, audits, solves, and judges its own work, has been released. Inspired by Meta FAIR's Autodata paper, it is domain-agnostic and uses a weak/strong solver filtering approach to ensure quality. The tool is in alpha and supports multiple LLM providers via LiteLLM.

read5 min views1 publishedJul 4, 2026
Show HN: Autosynth – generating synthetic data with strong/weak model filtering
Image: source

Generate synthetic datasets with an LLM loop that proposes, audits, solves, and judges its own work. Inspired by Meta FAIR's Autodata / Agentic Self-Instruct paper (blog post), but rewritten to be domain-agnostic: every domain-specific piece lives in a small Python plugin, and the runtime is the same regardless of whether you're generating math word problems, support-ticket triage data, or QA pairs from your own docs.

For each candidate datapoint, autosynth runs a weak solver and a strong solver, scores both against an LLM-generated rubric, and keeps the example only if the strong solver clearly beats the weak one on a quality-passing example. Failed rounds are reflected on and fed back into the next attempt.

Status:alpha. The API is still moving — pin a commit if you're depending on it.

uv pip install autosynth             # core
uv pip install "autosynth[hf]"       # + Hugging Face export

Python 3.10+. Plain pip install autosynth

works too. For a from-source / editable install for development, see CONTRIBUTING.md.

uv run autosynth run --config configs/mock_demo.yaml
uv run autosynth status outputs/mock-demo
uv run autosynth export --run outputs/mock-demo --format jsonl

The mock demo uses an in-process scripted "provider" and finishes in about a second. It writes outputs/mock-demo/run.db

plus a frozen config snapshot. The export

step is opt-in — the SQLite database is the source of truth.

LLM calls go through LiteLLM, so any provider it supports should work. Set the relevant key and reference the model in YAML:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
challenger:    { provider_model: anthropic/claude-haiku-4-5, temperature: 0.8 }
weak_solver:   { provider_model: openai/gpt-4o-mini }
strong_solver: { provider_model: openai/gpt-4o }
judge:         { provider_model: anthropic/claude-haiku-4-5, temperature: 0.0 }

You can mix providers across roles. The cheaper-vs-frontier split between the two solvers is the whole point — that's what produces the weak/strong gap that drives acceptance.

${VAR}

and ${VAR:default}

substitution works in any string field, so api_base: ${OLLAMA_HOST:http://localhost:11434}

does what you'd expect.

See configs/example_qa.yaml

and configs/example_math.yaml

for full real-provider configs.

For each source item, autosynth runs the same five-step loop until the candidate is accepted or loop.max_rounds

is exhausted:

Challenger proposes a candidate(input, reference_output, rubric)

.Quality audits the candidate for obvious problems.Weak andstrong solvers each take N attempts at the input.Judge scores every attempt against the rubric.Evaluator decides accept / reject. If reject,reflector writes feedback for the next round.

Three regimes decide whether a candidate is kept; pick per task with acceptance.mode

(or omit it to use the domain's default):

rubric(default) — the judge scores each rollout against the rubric; acceptance is a threshold-and-gap test. Best when quality is a matter of degree.verifiable— the domain checks answers programmatically (verify()

), the judge is skipped, and acceptance is a count gate:*weak must fail, strong must succeed.*Use for checkable answers (math, code, exact extraction). The bundledmath_word_problems

domain ships this way.judge— a loop-judge LLM reads the per-rollout weak/strong patterns and decides accept/improve each round. Use for open-ended tasks where no fixed threshold fits.

acceptance:
  mode: verifiable     # or: rubric | judge

The exact default thresholds live in AcceptanceConfig

(src/autosynth/config.py

); the mechanism, plus the loop.short_circuit_strong

cost-saver, is in docs/ARCHITECTURE.md.

A domain plugin is one class subclassing DomainAdapter

with six required methods. Scaffold one with:

uv run autosynth init-domain customer_support --out my_domain.py

Fill in load_grounding

, generation_prompt

, validate_candidate

, solver_prompt

, quality_prompt

, and judge_prompt

, then point your config at it. For a checkable-answer domain, also override verify()

and set default_acceptance_mode = "verifiable"

— the judge prompt is then unused.

domain:
  path: ./my_domain.py:CustomerSupport
  params:
    source_csv: ./tickets.csv

The two bundled domains (src/autosynth/domains/qa_from_documents.py

, math_word_problems.py

) are short and worth reading before you write your own.

autosynth metaopt --config CONFIG.yaml

runs the paper's secondary loop: evolve the orchestrator's prompts over generations, keeping a mutation only when it beats its parent on validation. Try it without keys:

uv run autosynth metaopt --config configs/metaopt_mock.yaml

The algorithm, the HarnessSpec

unit of evolution, and how to enable it for real are in docs/ARCHITECTURE.md.

autosynth run         --config CONFIG.yaml [--resume RUN_ID]   # generate a dataset
autosynth resume      RUN_DIR                                  # continue an interrupted run
autosynth status      RUN_DIR                                  # one-line progress
autosynth inspect-run RUN_DIR [--stuck]                        # detailed per-item table
autosynth export      --run RUN_DIR --format jsonl|hf          # write accepted records
autosynth metaopt     --config CONFIG.yaml                     # evolve the prompt harness
autosynth init-domain NAME --out my_domain.py                  # scaffold a domain plugin

Run autosynth <command> --help

for the full flag set. --stuck

filters to items that haven't reached a terminal state — what you want when something looks wrong.

Everything for a run lives under outputs/<run_id>/

: the run.db

SQLite database (the source of truth, safe to share), a config.snapshot.yaml

that resume reads back, and — only after autosynth export

accepted.jsonl

/ hf_export/

. Inspect the database directly with sqlite3 outputs/<run_id>/run.db .schema

; the table layout and accepted-record fields are documented in docs/ARCHITECTURE.md.

  • Every accepted datapoint carries an acceptance_rationale

and a serializedEvalReport

. There is no silent acceptance path. - The built-in PII filter ( safety.enabled: true

) is a conservative heuristic, not a real DLP. For anything regulated, plug your own module in viasafety.filter

. - Solvers are never toldthey're the weak or strong solver — the differential comes from the model/temperature choice. The paper flags adversarial prompting here as a gaming vector, so don't. - There is no diversity / near-duplicate check on accepted examples yet. If you need that, extend store.insert_accepted

with MinHash or embedding-based dedupe. - LLM-as-judge bias is what it is. The rubric weight cap and the positive-only rule from the paper help, but don't pretend they eliminate it.

The runtime is an event-sourced pipeline over SQLite: a pure step()

state machine, a dispatcher that fulfills LLM requests, and a durable store. Because step()

is pure, you can kill a run at any point and autosynth resume

picks up exactly where it left off. The full design — runtime, item state machine, and batch mode — is in docs/ARCHITECTURE.md.

uv run pytest          # runs offline against the in-process mock — no keys, no network

Setup, linting, commit conventions, and how to add a domain are in CONTRIBUTING.md.

MIT. See LICENSE

.

── more in #large-language-models 4 stories · sorted by recency
── more on @autosynth 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-autosynth-ge…] indexed:0 read:5min 2026-07-04 ·