Generate synthetic datasets with an LLM loop that proposes, audits, solves, and judges its own work. Inspired by Meta FAIR's Autodata / Agentic Self-Instruct paper (blog post), but rewritten to be domain-agnostic: every domain-specific piece lives in a small Python plugin, and the runtime is the same regardless of whether you're generating math word problems, support-ticket triage data, or QA pairs from your own docs.
For each candidate datapoint, autosynth runs a weak solver and a strong solver, scores both against an LLM-generated rubric, and keeps the example only if the strong solver clearly beats the weak one on a quality-passing example. Failed rounds are reflected on and fed back into the next attempt.
Status:alpha. The API is still moving — pin a commit if you're depending on it.
uv pip install autosynth # core
uv pip install "autosynth[hf]" # + Hugging Face export
Python 3.10+. Plain pip install autosynth
works too. For a from-source / editable install for development, see CONTRIBUTING.md.
uv run autosynth run --config configs/mock_demo.yaml
uv run autosynth status outputs/mock-demo
uv run autosynth export --run outputs/mock-demo --format jsonl
The mock demo uses an in-process scripted "provider" and finishes in about a second. It writes outputs/mock-demo/run.db
plus a frozen config snapshot. The export
step is opt-in — the SQLite database is the source of truth.
LLM calls go through LiteLLM, so any provider it supports should work. Set the relevant key and reference the model in YAML:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
challenger: { provider_model: anthropic/claude-haiku-4-5, temperature: 0.8 }
weak_solver: { provider_model: openai/gpt-4o-mini }
strong_solver: { provider_model: openai/gpt-4o }
judge: { provider_model: anthropic/claude-haiku-4-5, temperature: 0.0 }
You can mix providers across roles. The cheaper-vs-frontier split between the two solvers is the whole point — that's what produces the weak/strong gap that drives acceptance.
${VAR}
and ${VAR:default}
substitution works in any string field, so api_base: ${OLLAMA_HOST:http://localhost:11434}
does what you'd expect.
See configs/example_qa.yaml
and configs/example_math.yaml
for full real-provider configs.
For each source item, autosynth runs the same five-step loop until the candidate is accepted or loop.max_rounds
is exhausted:
Challenger proposes a candidate(input, reference_output, rubric)
.Quality audits the candidate for obvious problems.Weak andstrong solvers each take N attempts at the input.Judge scores every attempt against the rubric.Evaluator decides accept / reject. If reject,reflector writes feedback for the next round.
Three regimes decide whether a candidate is kept; pick per task with acceptance.mode
(or omit it to use the domain's default):
rubric(default) — the judge scores each rollout against the rubric; acceptance is a threshold-and-gap test. Best when quality is a matter of degree.verifiable— the domain checks answers programmatically (verify()
), the judge is skipped, and acceptance is a count gate:*weak must fail, strong must succeed.*Use for checkable answers (math, code, exact extraction). The bundledmath_word_problems
domain ships this way.judge— a loop-judge LLM reads the per-rollout weak/strong patterns and decides accept/improve each round. Use for open-ended tasks where no fixed threshold fits.
acceptance:
mode: verifiable # or: rubric | judge
The exact default thresholds live in AcceptanceConfig
(src/autosynth/config.py
); the mechanism, plus the loop.short_circuit_strong
cost-saver, is in docs/ARCHITECTURE.md.
A domain plugin is one class subclassing DomainAdapter
with six required methods. Scaffold one with:
uv run autosynth init-domain customer_support --out my_domain.py
Fill in load_grounding
, generation_prompt
, validate_candidate
, solver_prompt
, quality_prompt
, and judge_prompt
, then point your config at it. For a checkable-answer domain, also override verify()
and set default_acceptance_mode = "verifiable"
— the judge prompt is then unused.
domain:
path: ./my_domain.py:CustomerSupport
params:
source_csv: ./tickets.csv
The two bundled domains (src/autosynth/domains/qa_from_documents.py
, math_word_problems.py
) are short and worth reading before you write your own.
autosynth metaopt --config CONFIG.yaml
runs the paper's secondary loop: evolve the orchestrator's prompts over generations, keeping a mutation only when it beats its parent on validation. Try it without keys:
uv run autosynth metaopt --config configs/metaopt_mock.yaml
The algorithm, the HarnessSpec
unit of evolution, and how to enable it for real are in docs/ARCHITECTURE.md.
autosynth run --config CONFIG.yaml [--resume RUN_ID] # generate a dataset
autosynth resume RUN_DIR # continue an interrupted run
autosynth status RUN_DIR # one-line progress
autosynth inspect-run RUN_DIR [--stuck] # detailed per-item table
autosynth export --run RUN_DIR --format jsonl|hf # write accepted records
autosynth metaopt --config CONFIG.yaml # evolve the prompt harness
autosynth init-domain NAME --out my_domain.py # scaffold a domain plugin
Run autosynth <command> --help
for the full flag set. --stuck
filters to items that haven't reached a terminal state — what you want when something looks wrong.
Everything for a run lives under outputs/<run_id>/
: the run.db
SQLite database (the source of truth, safe to share), a config.snapshot.yaml
that resume reads back, and — only after autosynth export
— accepted.jsonl
/ hf_export/
. Inspect the database directly with sqlite3 outputs/<run_id>/run.db .schema
; the table layout and accepted-record fields are documented in docs/ARCHITECTURE.md.
- Every accepted datapoint carries an
acceptance_rationale
and a serializedEvalReport
. There is no silent acceptance path. - The built-in PII filter (
safety.enabled: true
) is a conservative heuristic, not a real DLP. For anything regulated, plug your own module in viasafety.filter
. - Solvers are never
toldthey're the weak or strong solver — the differential comes from the model/temperature choice. The paper flags adversarial prompting here as a gaming vector, so don't. - There is no diversity / near-duplicate check on accepted examples yet. If you need that, extend
store.insert_accepted
with MinHash or embedding-based dedupe. - LLM-as-judge bias is what it is. The rubric weight cap and the positive-only rule from the paper help, but don't pretend they eliminate it.
The runtime is an event-sourced pipeline over SQLite: a pure step()
state machine, a dispatcher that fulfills LLM requests, and a durable store. Because step()
is pure, you can kill a run at any point and autosynth resume
picks up exactly where it left off. The full design — runtime, item state machine, and batch mode — is in docs/ARCHITECTURE.md.
uv run pytest # runs offline against the in-process mock — no keys, no network
Setup, linting, commit conventions, and how to add a domain are in CONTRIBUTING.md.
MIT. See LICENSE
.