Show HN: Autosynth – generating synthetic data with strong/weak model filtering

Autosynth, a new open-source tool for generating synthetic datasets using an LLM loop that proposes, audits, solves, and judges its own work, has been released. Inspired by Meta FAIR's Autodata paper, it is domain-agnostic and uses a weak/strong solver filtering approach to ensure quality. The tool is in alpha and supports multiple LLM providers via LiteLLM.

Generate synthetic datasets with an LLM loop that proposes, audits, solves, and judges its own work. Inspired by Meta FAIR's Autodata / Agentic Self-Instruct https://doi.org/10.48550/arXiv.2606.25996 paper blog post https://facebookresearch.github.io/RAM/blogs/autodata/ , but rewritten to be domain-agnostic: every domain-specific piece lives in a small Python plugin, and the runtime is the same regardless of whether you're generating math word problems, support-ticket triage data, or QA pairs from your own docs. For each candidate datapoint, autosynth runs a weak solver and a strong solver, scores both against an LLM-generated rubric, and keeps the example only if the strong solver clearly beats the weak one on a quality-passing example. Failed rounds are reflected on and fed back into the next attempt. Status:alpha. The API is still moving — pin a commit if you're depending on it. uv pip install autosynth core uv pip install "autosynth hf " + Hugging Face export Python 3.10+. Plain pip install autosynth works too. For a from-source / editable install for development, see CONTRIBUTING.md /Ahmad8864/autosynth/blob/main/CONTRIBUTING.md . uv run autosynth run --config configs/mock demo.yaml uv run autosynth status outputs/mock-demo uv run autosynth export --run outputs/mock-demo --format jsonl The mock demo uses an in-process scripted "provider" and finishes in about a second. It writes outputs/mock-demo/run.db plus a frozen config snapshot. The export step is opt-in — the SQLite database is the source of truth. LLM calls go through LiteLLM https://docs.litellm.ai/ , so any provider it supports should work. Set the relevant key and reference the model in YAML: export OPENAI API KEY=... export ANTHROPIC API KEY=... challenger: { provider model: anthropic/claude-haiku-4-5, temperature: 0.8 } weak solver: { provider model: openai/gpt-4o-mini } strong solver: { provider model: openai/gpt-4o } judge: { provider model: anthropic/claude-haiku-4-5, temperature: 0.0 } You can mix providers across roles. The cheaper-vs-frontier split between the two solvers is the whole point — that's what produces the weak/strong gap that drives acceptance. ${VAR} and ${VAR:default} substitution works in any string field, so api base: ${OLLAMA HOST:http://localhost:11434} does what you'd expect. See configs/example qa.yaml and configs/example math.yaml for full real-provider configs. For each source item, autosynth runs the same five-step loop until the candidate is accepted or loop.max rounds is exhausted: Challenger proposes a candidate input, reference output, rubric . Quality audits the candidate for obvious problems. Weak and strong solvers each take N attempts at the input. Judge scores every attempt against the rubric. Evaluator decides accept / reject. If reject, reflector writes feedback for the next round. Three regimes decide whether a candidate is kept; pick per task with acceptance.mode or omit it to use the domain's default : rubric default — the judge scores each rollout against the rubric; acceptance is a threshold-and-gap test. Best when quality is a matter of degree. verifiable — the domain checks answers programmatically verify , the judge is skipped, and acceptance is a count gate: weak must fail, strong must succeed. Use for checkable answers math, code, exact extraction . The bundled math word problems domain ships this way. judge — a loop-judge LLM reads the per-rollout weak/strong patterns and decides accept/improve each round. Use for open-ended tasks where no fixed threshold fits. acceptance: mode: verifiable or: rubric | judge The exact default thresholds live in AcceptanceConfig src/autosynth/config.py ; the mechanism, plus the loop.short circuit strong cost-saver, is in docs/ARCHITECTURE.md /Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md acceptance . A domain plugin is one class subclassing DomainAdapter with six required methods. Scaffold one with: uv run autosynth init-domain customer support --out my domain.py Fill in load grounding , generation prompt , validate candidate , solver prompt , quality prompt , and judge prompt , then point your config at it. For a checkable-answer domain, also override verify and set default acceptance mode = "verifiable" — the judge prompt is then unused. domain: path: ./my domain.py:CustomerSupport params: source csv: ./tickets.csv The two bundled domains src/autosynth/domains/qa from documents.py , math word problems.py are short and worth reading before you write your own. autosynth metaopt --config CONFIG.yaml runs the paper's secondary loop: evolve the orchestrator's prompts over generations, keeping a mutation only when it beats its parent on validation. Try it without keys: uv run autosynth metaopt --config configs/metaopt mock.yaml The algorithm, the HarnessSpec unit of evolution, and how to enable it for real are in docs/ARCHITECTURE.md /Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md meta-optimization . autosynth run --config CONFIG.yaml --resume RUN ID generate a dataset autosynth resume RUN DIR continue an interrupted run autosynth status RUN DIR one-line progress autosynth inspect-run RUN DIR --stuck detailed per-item table autosynth export --run RUN DIR --format jsonl|hf write accepted records autosynth metaopt --config CONFIG.yaml evolve the prompt harness autosynth init-domain NAME --out my domain.py scaffold a domain plugin Run autosynth <command --help for the full flag set. --stuck filters to items that haven't reached a terminal state — what you want when something looks wrong. Everything for a run lives under outputs/<run id / : the run.db SQLite database the source of truth, safe to share , a config.snapshot.yaml that resume reads back, and — only after autosynth export — accepted.jsonl / hf export/ . Inspect the database directly with sqlite3 outputs/<run id /run.db .schema ; the table layout and accepted-record fields are documented in docs/ARCHITECTURE.md /Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md run-database . - Every accepted datapoint carries an acceptance rationale and a serialized EvalReport . There is no silent acceptance path. - The built-in PII filter safety.enabled: true is a conservative heuristic, not a real DLP. For anything regulated, plug your own module in via safety.filter . - Solvers are never told they're the weak or strong solver — the differential comes from the model/temperature choice. The paper flags adversarial prompting here as a gaming vector, so don't. - There is no diversity / near-duplicate check on accepted examples yet. If you need that, extend store.insert accepted with MinHash or embedding-based dedupe. - LLM-as-judge bias is what it is. The rubric weight cap and the positive-only rule from the paper help, but don't pretend they eliminate it. The runtime is an event-sourced pipeline over SQLite: a pure step state machine, a dispatcher that fulfills LLM requests, and a durable store. Because step is pure, you can kill a run at any point and autosynth resume picks up exactly where it left off. The full design — runtime, item state machine, and batch mode — is in docs/ARCHITECTURE.md /Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md . uv run pytest runs offline against the in-process mock — no keys, no network Setup, linting, commit conventions, and how to add a domain are in CONTRIBUTING.md /Ahmad8864/autosynth/blob/main/CONTRIBUTING.md . MIT. See LICENSE .