{"slug": "show-hn-autosynth-generating-synthetic-data-with-strong-weak-model-filtering", "title": "Show HN: Autosynth – generating synthetic data with strong/weak model filtering", "summary": "Autosynth, a new open-source tool for generating synthetic datasets using an LLM loop that proposes, audits, solves, and judges its own work, has been released. Inspired by Meta FAIR's Autodata paper, it is domain-agnostic and uses a weak/strong solver filtering approach to ensure quality. The tool is in alpha and supports multiple LLM providers via LiteLLM.", "body_md": "Generate synthetic datasets with an LLM loop that proposes, audits, solves, and judges its own work. Inspired by Meta FAIR's [Autodata / Agentic Self-Instruct](https://doi.org/10.48550/arXiv.2606.25996) paper ([blog post](https://facebookresearch.github.io/RAM/blogs/autodata/)), but rewritten to be domain-agnostic: every domain-specific piece lives in a small Python plugin, and the runtime is the same regardless of whether you're generating math word problems, support-ticket triage data, or QA pairs from your own docs.\n\nFor each candidate datapoint, autosynth runs a *weak* solver and a *strong* solver, scores both against an LLM-generated rubric, and keeps the example only if the strong solver clearly beats the weak one on a quality-passing example. Failed rounds are reflected on and fed back into the next attempt.\n\nStatus:alpha. The API is still moving — pin a commit if you're depending on it.\n\n```\nuv pip install autosynth             # core\nuv pip install \"autosynth[hf]\"       # + Hugging Face export\n```\n\nPython 3.10+. Plain `pip install autosynth`\n\nworks too. For a from-source / editable install for development, see [CONTRIBUTING.md](/Ahmad8864/autosynth/blob/main/CONTRIBUTING.md).\n\n```\nuv run autosynth run --config configs/mock_demo.yaml\nuv run autosynth status outputs/mock-demo\nuv run autosynth export --run outputs/mock-demo --format jsonl\n```\n\nThe mock demo uses an in-process scripted \"provider\" and finishes in about a second. It writes `outputs/mock-demo/run.db`\n\nplus a frozen config snapshot. The `export`\n\nstep is opt-in — the SQLite database is the source of truth.\n\nLLM calls go through [LiteLLM](https://docs.litellm.ai/), so any provider it supports should work. Set the relevant key and reference the model in YAML:\n\n```\nexport OPENAI_API_KEY=...\nexport ANTHROPIC_API_KEY=...\nchallenger:    { provider_model: anthropic/claude-haiku-4-5, temperature: 0.8 }\nweak_solver:   { provider_model: openai/gpt-4o-mini }\nstrong_solver: { provider_model: openai/gpt-4o }\njudge:         { provider_model: anthropic/claude-haiku-4-5, temperature: 0.0 }\n```\n\nYou can mix providers across roles. The cheaper-vs-frontier split between the two solvers is the whole point — that's what produces the weak/strong gap that drives acceptance.\n\n`${VAR}`\n\nand `${VAR:default}`\n\nsubstitution works in any string field, so `api_base: ${OLLAMA_HOST:http://localhost:11434}`\n\ndoes what you'd expect.\n\nSee `configs/example_qa.yaml`\n\nand `configs/example_math.yaml`\n\nfor full real-provider configs.\n\nFor each source item, autosynth runs the same five-step loop until the candidate is accepted or `loop.max_rounds`\n\nis exhausted:\n\n**Challenger** proposes a candidate`(input, reference_output, rubric)`\n\n.**Quality** audits the candidate for obvious problems.**Weak** and**strong** solvers each take N attempts at the input.**Judge** scores every attempt against the rubric.**Evaluator** decides accept / reject. If reject,**reflector** writes feedback for the next round.\n\nThree regimes decide whether a candidate is kept; pick per task with `acceptance.mode`\n\n(or omit it to use the domain's default):\n\n**rubric**(default) — the judge scores each rollout against the rubric; acceptance is a threshold-and-gap test. Best when quality is a matter of degree.**verifiable**— the domain checks answers programmatically (`verify()`\n\n), the judge is skipped, and acceptance is a count gate:*weak must fail, strong must succeed.*Use for checkable answers (math, code, exact extraction). The bundled`math_word_problems`\n\ndomain ships this way.**judge**— a loop-judge LLM reads the per-rollout weak/strong patterns and decides accept/improve each round. Use for open-ended tasks where no fixed threshold fits.\n\n```\nacceptance:\n  mode: verifiable     # or: rubric | judge\n```\n\nThe exact default thresholds live in `AcceptanceConfig`\n\n(`src/autosynth/config.py`\n\n); the mechanism, plus the `loop.short_circuit_strong`\n\ncost-saver, is in [docs/ARCHITECTURE.md](/Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md#acceptance).\n\nA domain plugin is one class subclassing `DomainAdapter`\n\nwith six required methods. Scaffold one with:\n\n```\nuv run autosynth init-domain customer_support --out my_domain.py\n```\n\nFill in `load_grounding`\n\n, `generation_prompt`\n\n, `validate_candidate`\n\n, `solver_prompt`\n\n, `quality_prompt`\n\n, and `judge_prompt`\n\n, then point your config at it. For a checkable-answer domain, also override `verify()`\n\nand set `default_acceptance_mode = \"verifiable\"`\n\n— the judge prompt is then unused.\n\n```\ndomain:\n  path: ./my_domain.py:CustomerSupport\n  params:\n    source_csv: ./tickets.csv\n```\n\nThe two bundled domains (`src/autosynth/domains/qa_from_documents.py`\n\n, `math_word_problems.py`\n\n) are short and worth reading before you write your own.\n\n`autosynth metaopt --config CONFIG.yaml`\n\nruns the paper's secondary loop: evolve the orchestrator's *prompts* over generations, keeping a mutation only when it beats its parent on validation. Try it without keys:\n\n```\nuv run autosynth metaopt --config configs/metaopt_mock.yaml\n```\n\nThe algorithm, the `HarnessSpec`\n\nunit of evolution, and how to enable it for real are in [docs/ARCHITECTURE.md](/Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md#meta-optimization).\n\n```\nautosynth run         --config CONFIG.yaml [--resume RUN_ID]   # generate a dataset\nautosynth resume      RUN_DIR                                  # continue an interrupted run\nautosynth status      RUN_DIR                                  # one-line progress\nautosynth inspect-run RUN_DIR [--stuck]                        # detailed per-item table\nautosynth export      --run RUN_DIR --format jsonl|hf          # write accepted records\nautosynth metaopt     --config CONFIG.yaml                     # evolve the prompt harness\nautosynth init-domain NAME --out my_domain.py                  # scaffold a domain plugin\n```\n\nRun `autosynth <command> --help`\n\nfor the full flag set. `--stuck`\n\nfilters to items that haven't reached a terminal state — what you want when something looks wrong.\n\nEverything for a run lives under `outputs/<run_id>/`\n\n: the `run.db`\n\nSQLite database (the source of truth, safe to share), a `config.snapshot.yaml`\n\nthat resume reads back, and — only after `autosynth export`\n\n— `accepted.jsonl`\n\n/ `hf_export/`\n\n. Inspect the database directly with `sqlite3 outputs/<run_id>/run.db .schema`\n\n; the table layout and accepted-record fields are documented in [docs/ARCHITECTURE.md](/Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md#run-database).\n\n- Every accepted datapoint carries an\n`acceptance_rationale`\n\nand a serialized`EvalReport`\n\n. There is no silent acceptance path. - The built-in PII filter (\n`safety.enabled: true`\n\n) is a conservative heuristic, not a real DLP. For anything regulated, plug your own module in via`safety.filter`\n\n. - Solvers are never\n*told*they're the weak or strong solver — the differential comes from the model/temperature choice. The paper flags adversarial prompting here as a gaming vector, so don't. - There is no diversity / near-duplicate check on accepted examples yet. If you need that, extend\n`store.insert_accepted`\n\nwith MinHash or embedding-based dedupe. - LLM-as-judge bias is what it is. The rubric weight cap and the positive-only rule from the paper help, but don't pretend they eliminate it.\n\nThe runtime is an event-sourced pipeline over SQLite: a pure `step()`\n\nstate machine, a dispatcher that fulfills LLM requests, and a durable store. Because `step()`\n\nis pure, you can kill a run at any point and `autosynth resume`\n\npicks up exactly where it left off. The full design — runtime, item state machine, and batch mode — is in [docs/ARCHITECTURE.md](/Ahmad8864/autosynth/blob/main/docs/ARCHITECTURE.md).\n\n```\nuv run pytest          # runs offline against the in-process mock — no keys, no network\n```\n\nSetup, linting, commit conventions, and how to add a domain are in [CONTRIBUTING.md](/Ahmad8864/autosynth/blob/main/CONTRIBUTING.md).\n\nMIT. See `LICENSE`\n\n.", "url": "https://wpnews.pro/news/show-hn-autosynth-generating-synthetic-data-with-strong-weak-model-filtering", "canonical_source": "https://github.com/Ahmad8864/autosynth", "published_at": "2026-07-04 09:16:38+00:00", "updated_at": "2026-07-04 09:19:31.736642+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-tools", "ai-research"], "entities": ["Autosynth", "Meta FAIR", "LiteLLM", "OpenAI", "Anthropic", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/show-hn-autosynth-generating-synthetic-data-with-strong-weak-model-filtering", "markdown": "https://wpnews.pro/news/show-hn-autosynth-generating-synthetic-data-with-strong-weak-model-filtering.md", "text": "https://wpnews.pro/news/show-hn-autosynth-generating-synthetic-data-with-strong-weak-model-filtering.txt", "jsonld": "https://wpnews.pro/news/show-hn-autosynth-generating-synthetic-data-with-strong-weak-model-filtering.jsonld"}}