Open-source
Mixture-of-Agentscompound-model server — a self-hostable alternative to OpenRouter's Fusion API.
Fan a prompt out to a panel of LLMs in parallel, let a judge extract the structure of their answers (consensus, contradictions, partial coverage, unique insights), then a synthesizer writes one final answer grounded in that analysis. The result beats any single panelist — and a panel of budget models can rival a frontier model at a fraction of the cost.
It speaks the OpenAI API, so it drops into any existing OpenAI client: point base_url
at fusionHarness and use the model slug fusion
.
┌─────────────┐
prompt ─► │ fan-out │ ─► model A ─┐
│ (panel) │ ─► model B ─┤ (parallel, each tool-enabled)
└─────────────┘ ─► model C ─┘
│
▼
┌───────────┐ ┌──────────────┐
│ judge │ ──► │ synthesizer │ ─► final answer
│ structure │ │ grounded │ + cost / latency
└───────────┘ └──────────────┘
Why it works (OpenRouter's own ablation): ~¾ of the lift comes from
synthesis, ~¼ fromdiversity.
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env # then put your key in FUSION_API_KEY
fusion serve --config configs/budget.yaml
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"fusion","messages":[{"role":"user","content":"Compare CRDTs vs OT for collaborative editing."}]}'
From the OpenAI Python SDK (pip install openai
):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
resp = client.chat.completions.create(
model="fusion",
messages=[{"role": "user", "content": "..."}],
)
print(resp.choices[0].message.content)
Or straight from the terminal, no server:
export FUSION_API_KEY=sk-or-...
fusion ask "What are the trade-offs between gRPC and REST?" --config configs/budget.yaml
Because fusion speaks OpenAI, it drops into any agent harness — or use our own.
fusion chat --config configs/budget.yaml
pi install ./integrations/pi
bash integrations/pi/install.sh
pi --model fusion
Adapters for Pi, Claude Code, aider, Continue, LangChain, and the OpenAI SDK are in integrations/. Verify the whole stack end-to-end with no API key:
scripts/smoke.sh --fake # boots a key-free fake backend + the real server
fusionHarness is also its own agentic coding harness — like Claude Code, but
the brain can convene the fusion panel. The agent reads, writes, and edits files,
searches, and runs bash in a tool-use loop confined to a project root, and can
call council
to escalate a hard sub-question to the full panel.
fusion code "add a /version endpoint and a test for it" --root .
fusion code # interactive agent session
fusion code "refactor X" --plan # write a plan first, then act
fusion code "delete dead code" --approve # confirm each file/bash action
Each step is printed as it happens; the agent calls finish
when the task is
done and verified. Tools are confined to --root
(default: cwd). For a hard
sub-problem the agent can call the council
tool, which convenes the full fusion
panel and returns a synthesized answer. --approve
gates every mutating tool
(write/edit/bash); --plan
makes it write a numbered plan before acting.
⚠️ Security:the agent runs bash and edits files. Confinement blocks path escapes, not arbitrary command effects — run it on projects you trust, or in a container.
A config picks the panel, judge, and synthesizer. Two presets ship in configs/
:
| Preset | Panel | Use it for |
|---|---|---|
configs/budget.yaml |
||
| Gemini 3 Flash · Kimi K2.6 · DeepSeek V4 Pro | frontier-ish quality at ~half the price | |
configs/frontier.yaml |
||
| Opus 4.8 · GPT-5.5 · Gemini 3.1 Pro | beyond-frontier quality |
Custom panel:
name: fusion
panel:
- anthropic/claude-opus-4.8
- openai/gpt-5.5
- model: deepseek/deepseek-v4-pro # long form allows per-model overrides
temperature: 0.3
tools: [web_search]
judge: openai/gpt-5.5
synthesizer: anthropic/claude-opus-4.8
temperature: 0.7
max_tokens: 4096
tools_enabled: false
Model slugs follow OpenRouter conventions (vendor/model
). Point at a different
backend with FUSION_BASE_URL
(OpenAI, a local vLLM/Ollama server, Groq,
Together — anything OpenAI-compatible). API keys come from the environment only
(FUSION_API_KEY
, OPENROUTER_API_KEY
, or OPENAI_API_KEY
), never from YAML.
All optional config fields (with defaults):
| Key | Default | What it does |
|---|---|---|
refine |
||
false |
||
| Run one extra self-critique pass over the synthesized answer (quality ↑, cost ↑). | ||
layers |
||
1 |
||
Multi-layer MoA — with layers>1 , proposers see the previous layer's drafts and improve before the final synthesis. |
||
samples |
||
1 |
||
| Self-consistency — sample each proposer K times so the judge/synthesizer see more drafts. | ||
diversity |
||
true |
||
| Spread panelist temperatures so drafts differ (≈¼ of the lift). | ||
diversity_jitter |
||
0.3 |
||
| How wide to spread temperatures (the MoA diversity↔quality trade-off — keep it modest). | ||
max_retries |
||
2 |
||
| Retry transient upstream failures (429/5xx/timeout) so a flaky panelist doesn't shrink the panel. | ||
retry_backoff |
||
0.5 |
||
| Base seconds for exponential retry backoff. | ||
max_concurrency |
||
0 |
||
| Cap concurrent panelist calls (0 = unlimited). |
If the judge fails, synthesis still runs from the raw responses; if the
synthesizer fails, the best panelist's answer is returned. Anything that
degraded is reported in the response's fusion.degraded
list — never silently.
Panelists can call tools while drafting — useful for deep-research tasks. Tools are off by default. Enable globally and per-model:
tools_enabled: true
panel:
- model: deepseek/deepseek-v4-pro
tools: [web_search, bash]
web_search
— keyless DuckDuckGo Instant Answer by default; swap in a Tavily/Brave/SerpAPI backend viadefault_registry(search_fn=...)
.bash
— runs in a sandboxed shell (timeout, stripped env, output truncation).
⚠️ Security:bash
executes commands the model writes. The sandbox is not a container. Run the server in a disposable VM/container before enablingbash
with untrusted input. It is opt-in because it is dangerous.
Every response carries the real numbers. Non-streaming responses include a
fusion
block plus headers:
{
"choices": [ ... ],
"usage": { "prompt_tokens": 1234, "completion_tokens": 567, "total_tokens": 1801 },
"fusion": {
"config": "fusion",
"panel_models": ["google/gemini-3-flash", "moonshotai/kimi-k2.6", "deepseek/deepseek-v4-pro"],
"panel_succeeded": 3,
"cost_usd": 0.0123,
"cost_breakdown": [ { "model": "...", "role": "panel", "cost_usd": 0.004 }, ... ],
"timing_s": { "panel": 2.1, "judge": 0.8, "synth": 3.4, "total": 6.3 }
}
}
Headers: x-fusion-cost-usd
, x-fusion-latency-s
. When the backend reports an
authoritative per-call cost, that value is used instead of the local price table
(fusion/pricing.py
).
Reproduce the panel-vs-solo comparison on DRACO-style weighted tasks (negative criteria penalize wrong claims, so you can't bluff a high score):
fusion eval --dry-run
fusion eval --ab --dry-run
fusion eval --config configs/budget.yaml --tasks eval/tasks.sample.yaml
Panel vs solo — scored on 3 task(s)
panel+refine 100.0% ████████████████████ ★ (+28.6 vs best solo)
panel 76.2% ███████████████ ★ (+4.8 vs best solo)
google/gemini-3-flash (solo) 71.4% ██████████████
--runs N
repeats each task K times (self-consistency). Add your own tasks in
eval/tasks.sample.yaml
(id, prompt, weighted criteria). The dry-run numbers are from deterministic stubs — real lift needs a key; run the live command above.
| Method | Path | Description |
|---|---|---|
POST |
||
/v1/chat/completions |
||
OpenAI-compatible; stream:true supported. Model slug fusion . |
||
GET |
||
/v1/models |
||
Lists fusion plus the configured panel models. |
||
GET |
||
/health |
||
| Liveness + active config + panel. |
Per-request overrides. Customize the panel with per-request overrides (like
OpenRouter Fusion's "pass your own participant models and synthesizer") via a
fusion
block in the body. Only safe model-selection/flag keys are honored — the backend URL and keys can never be set from the request:
{
"model": "fusion",
"messages": [{"role": "user", "content": "..."}],
"fusion": {
"panel": ["anthropic/claude-opus-4.8", "openai/gpt-5.5"],
"synthesizer": "anthropic/claude-opus-4.8",
"refine": true,
"layers": 2
}
}
fusion/ engine + server + harnesses
├─ providers · panel · judge · synthesize · fusion (MoA engine)
├─ tools · server · streaming · schemas · pricing · config
├─ tui.py (fusion chat — TUI harness)
└─ agent.py · agent_tools.py · cli.py (fusion code — agent harness)
eval/ DRACO-style evaluation harness (scorer, harness, tasks.sample.yaml)
configs/ panel presets (budget.yaml, frontier.yaml)
integrations/ harness adapters — Pi package, OpenAI-SDK example, adapter guide
scripts/ smoke.sh, verify_install.sh, verify_all.sh
docs/ architecture.md, parity.md
tests/ pytest suite — unit (providers mocked) + real-HTTP e2e (live servers), no API key
pip install -e ".[dev]"
pytest -q # full suite, no network required
See docs/architecture.md for the design and docs/parity.md for the parity matrix & roadmap.
MIT — see LICENSE.