Show HN: FusionHarness – An Open-source Mixture-of-Agents compound-model server

wpnews.pro

Open-source

Mixture-of-Agentscompound-model server — a self-hostable alternative to OpenRouter's Fusion API.

Fan a prompt out to a panel of LLMs in parallel, let a judge extract the structure of their answers (consensus, contradictions, partial coverage, unique insights), then a synthesizer writes one final answer grounded in that analysis. The result beats any single panelist — and a panel of budget models can rival a frontier model at a fraction of the cost.

It speaks the OpenAI API, so it drops into any existing OpenAI client: point base_url

at fusionHarness and use the model slug fusion

.

          ┌─────────────┐
prompt ─► │   fan-out   │ ─► model A ─┐
          │   (panel)   │ ─► model B ─┤  (parallel, each tool-enabled)
          └─────────────┘ ─► model C ─┘
                                  │
                                  ▼
                            ┌───────────┐     ┌──────────────┐
                            │   judge   │ ──► │ synthesizer  │ ─► final answer
                            │ structure │     │  grounded    │    + cost / latency
                            └───────────┘     └──────────────┘

Why it works (OpenRouter's own ablation): ~¾ of the lift comes from

synthesis, ~¼ fromdiversity.

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env       # then put your key in FUSION_API_KEY

fusion serve --config configs/budget.yaml

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"fusion","messages":[{"role":"user","content":"Compare CRDTs vs OT for collaborative editing."}]}'

From the OpenAI Python SDK (pip install openai

):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
resp = client.chat.completions.create(
    model="fusion",
    messages=[{"role": "user", "content": "..."}],
)
print(resp.choices[0].message.content)

Or straight from the terminal, no server:

export FUSION_API_KEY=sk-or-...
fusion ask "What are the trade-offs between gRPC and REST?" --config configs/budget.yaml

Because fusion speaks OpenAI, it drops into any agent harness — or use our own.

fusion chat --config configs/budget.yaml

pi install ./integrations/pi
bash integrations/pi/install.sh
pi --model fusion

Adapters for Pi, Claude Code, aider, Continue, LangChain, and the OpenAI SDK are in integrations/. Verify the whole stack end-to-end with no API key:

scripts/smoke.sh --fake     # boots a key-free fake backend + the real server

fusionHarness is also its own agentic coding harness — like Claude Code, but the brain can convene the fusion panel. The agent reads, writes, and edits files, searches, and runs bash in a tool-use loop confined to a project root, and can call council

to escalate a hard sub-question to the full panel.

fusion code "add a /version endpoint and a test for it" --root .
fusion code                       # interactive agent session
fusion code "refactor X" --plan   # write a plan first, then act
fusion code "delete dead code" --approve   # confirm each file/bash action

Each step is printed as it happens; the agent calls finish

when the task is done and verified. Tools are confined to --root

(default: cwd). For a hard sub-problem the agent can call the council

tool, which convenes the full fusion panel and returns a synthesized answer. --approve

gates every mutating tool (write/edit/bash); --plan

makes it write a numbered plan before acting.

⚠️ Security:the agent runs bash and edits files. Confinement blocks path escapes, not arbitrary command effects — run it on projects you trust, or in a container.

A config picks the panel, judge, and synthesizer. Two presets ship in configs/

:

Preset	Panel	Use it for
`configs/budget.yaml`
Gemini 3 Flash · Kimi K2.6 · DeepSeek V4 Pro	frontier-ish quality at ~half the price
`configs/frontier.yaml`
Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro	beyond-frontier quality

Custom panel:

name: fusion
panel:
  - anthropic/claude-opus-4.8
  - openai/gpt-5.5
  - model: deepseek/deepseek-v4-pro    # long form allows per-model overrides
    temperature: 0.3
    tools: [web_search]
judge: openai/gpt-5.5
synthesizer: anthropic/claude-opus-4.8
temperature: 0.7
max_tokens: 4096
tools_enabled: false

Model slugs follow OpenRouter conventions (vendor/model

). Point at a different backend with FUSION_BASE_URL

(OpenAI, a local vLLM/Ollama server, Groq, Together — anything OpenAI-compatible). API keys come from the environment only (FUSION_API_KEY

, OPENROUTER_API_KEY

, or OPENAI_API_KEY

), never from YAML.

All optional config fields (with defaults):

Key	Default	What it does
`refine`
`false`
Run one extra self-critique pass over the synthesized answer (quality ↑, cost ↑).
`layers`
`1`
Multi-layer MoA — with `layers>1` , proposers see the previous layer's drafts and improve before the final synthesis.
`samples`
`1`
Self-consistency — sample each proposer K times so the judge/synthesizer see more drafts.
`diversity`
`true`
Spread panelist temperatures so drafts differ (≈¼ of the lift).
`diversity_jitter`
`0.3`
How wide to spread temperatures (the MoA diversity↔quality trade-off — keep it modest).
`max_retries`
`2`
Retry transient upstream failures (429/5xx/timeout) so a flaky panelist doesn't shrink the panel.
`retry_backoff`
`0.5`
Base seconds for exponential retry backoff.
`max_concurrency`
`0`
Cap concurrent panelist calls (0 = unlimited).

If the judge fails, synthesis still runs from the raw responses; if the synthesizer fails, the best panelist's answer is returned. Anything that degraded is reported in the response's fusion.degraded

list — never silently.

Panelists can call tools while drafting — useful for deep-research tasks. Tools are off by default. Enable globally and per-model:

tools_enabled: true
panel:
  - model: deepseek/deepseek-v4-pro
    tools: [web_search, bash]

web_search

— keyless DuckDuckGo Instant Answer by default; swap in a Tavily/Brave/SerpAPI backend viadefault_registry(search_fn=...)

.bash

— runs in a sandboxed shell (timeout, stripped env, output truncation).

⚠️ Security:bash

executes commands the model writes. The sandbox is not a container. Run the server in a disposable VM/container before enablingbash

with untrusted input. It is opt-in because it is dangerous.

Every response carries the real numbers. Non-streaming responses include a fusion

block plus headers:

{
  "choices": [ ... ],
  "usage": { "prompt_tokens": 1234, "completion_tokens": 567, "total_tokens": 1801 },
  "fusion": {
    "config": "fusion",
    "panel_models": ["google/gemini-3-flash", "moonshotai/kimi-k2.6", "deepseek/deepseek-v4-pro"],
    "panel_succeeded": 3,
    "cost_usd": 0.0123,
    "cost_breakdown": [ { "model": "...", "role": "panel", "cost_usd": 0.004 }, ... ],
    "timing_s": { "panel": 2.1, "judge": 0.8, "synth": 3.4, "total": 6.3 }
  }
}

Headers: x-fusion-cost-usd

, x-fusion-latency-s

. When the backend reports an authoritative per-call cost, that value is used instead of the local price table (fusion/pricing.py

).

Reproduce the panel-vs-solo comparison on DRACO-style weighted tasks (negative criteria penalize wrong claims, so you can't bluff a high score):

fusion eval --dry-run

fusion eval --ab --dry-run

fusion eval --config configs/budget.yaml --tasks eval/tasks.sample.yaml
Panel vs solo — scored on 3 task(s)

panel+refine                     100.0%  ████████████████████  ★ (+28.6 vs best solo)
panel                             76.2%  ███████████████  ★ (+4.8 vs best solo)
google/gemini-3-flash (solo)      71.4%  ██████████████

--runs N

repeats each task K times (self-consistency). Add your own tasks in eval/tasks.sample.yaml

(id, prompt, weighted criteria). The dry-run numbers are from deterministic stubs — real lift needs a key; run the live command above.

Method	Path	Description
`POST`
`/v1/chat/completions`
OpenAI-compatible; `stream:true` supported. Model slug `fusion` .
`GET`
`/v1/models`
Lists `fusion` plus the configured panel models.
`GET`
`/health`
Liveness + active config + panel.

Per-request overrides. Customize the panel with per-request overrides (like OpenRouter Fusion's "pass your own participant models and synthesizer") via a fusion

block in the body. Only safe model-selection/flag keys are honored — the backend URL and keys can never be set from the request:

{
  "model": "fusion",
  "messages": [{"role": "user", "content": "..."}],
  "fusion": {
    "panel": ["anthropic/claude-opus-4.8", "openai/gpt-5.5"],
    "synthesizer": "anthropic/claude-opus-4.8",
    "refine": true,
    "layers": 2
  }
}
fusion/        engine + server + harnesses
               ├─ providers · panel · judge · synthesize · fusion   (MoA engine)
               ├─ tools · server · streaming · schemas · pricing · config
               ├─ tui.py        (fusion chat — TUI harness)
               └─ agent.py · agent_tools.py · cli.py   (fusion code — agent harness)
eval/          DRACO-style evaluation harness (scorer, harness, tasks.sample.yaml)
configs/       panel presets (budget.yaml, frontier.yaml)
integrations/  harness adapters — Pi package, OpenAI-SDK example, adapter guide
scripts/       smoke.sh, verify_install.sh, verify_all.sh
docs/          architecture.md, parity.md
tests/         pytest suite — unit (providers mocked) + real-HTTP e2e (live servers), no API key
pip install -e ".[dev]"
pytest -q          # full suite, no network required

See docs/architecture.md for the design and docs/parity.md for the parity matrix & roadmap.

MIT — see LICENSE.

source & further reading

github.com — original article

Show HN: FusionHarness – An Open-source Mixture-of-Agents compound-model server

Run your AI side-project on zahid.host