An open-source, drop-in compound-model proxy. Point any OpenAI-compatible tool at it,
set model: "openfusion"
, and your prompt is fanned out to a panel of LLMs in parallel β then a judge model reads every response (consensus, contradictions, blind spots) and streams back a single synthesized answer that aims to beat any one of them.
It's the open version of the mixture-of-agents idea behind OpenRouter's Fusion: better answers from models you already pay for, as a tunable, forkable recipe instead of a black box.
** Quick start** Β·
New here? You only need the first two to run it; the rest is for tuning and contributing.
| Path | What it is |
|---|---|
openfusion/ |
|
The proxy (FastAPI). Start with server.py ; see |
|
web/ |
|
The playground UI source (React + shadcn). Built assets ship in openfusion/static/ . |
|
examples/ |
|
| Copy-paste config recipes (preset, dev, panel, benchβ¦). You don't need a config to start. | |
bench/ |
|
Reproducible head-to-head harness; bench/FINDINGS.md is where fusion does and doesn't pay off. |
|
DESIGN.md Β· docs/ |
|
| Design rationale, architecture, and security notes. |
Beta β panel fan-out, judge synthesis, SSE streaming, web-tool fusion, an Auto Router, debate/ vote/ranked aggregators, production limits, and an interactive playground. See DESIGN.md and docs/ARCHITECTURE.md for architecture and security notes.
openfusion
has two front ends β an interactive terminal chat and a web playground. No clone, no config, no env vars needed to start.
uvx --from git+https://github.com/shahar-dagan/openfusion openfusion # ephemeral, needs uv
Bare openfusion
drops you into a Rich-rendered chat with the model panel β a banner, a live
panel-progress spinner, Markdown answers with syntax-highlighted code, and slash commands
(/preset
, /tokens
, /models
, /key
, /clear
). On first run it asks for your OpenRouter key and
saves it (~/.config/openfusion/credentials
), so later runs don't re-prompt; use /key
to
change it. Pipe for one-shots: echo "β¦" | openfusion
.
openfusion web # opens the playground in your browser
openfusion web
pops the playground open at http://localhost:8000
once the server is ready (pass
--no-open
, or it's skipped automatically in non-interactive/headless/Docker contexts). Paste your key (kept only in server memory) and fuse. With nothing configured it boots the Budget preset (a diverse panel + judge with web search) so the first run lands where fusion actually wins.
uv tool install . # from a clone β or: pipx install . && pipx ensurepath
For active development, pip install -e .
inside an activated venv (the command then works only
while that venv is active). A bare pip install -e .
does not put openfusion
on your global PATH β see Troubleshooting.
For a fixed recipe, write an openfusion.yaml
(start from examples/preset.yaml.example
β
preset: quality | budget
, or examples/default.yaml.example
for a fully spelled-out panel/judge). A preset expands to a diverse OpenRouter panel + judge with web tools on, mirroring OpenRouter Fusion's Quality/Budget switch:
| Preset | Panel | Judge | Tools |
|---|---|---|---|
quality |
|||
| Claude Sonnet 4 Β· Gemini 3 Pro Β· DeepSeek V4 Pro | Claude Sonnet 4 | web search + fetch | |
budget |
|||
| GPT-4o-mini Β· DeepSeek V4 Pro Β· Kimi K2.6 | DeepSeek V4 Pro | web search + fetch |
Use as a drop-in API from the OpenAI SDK (with openfusion web
running):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local-dev")
stream = client.chat.completions.create(
model="openfusion",
messages=[{"role": "user", "content": "Explain mixture-of-agents in one paragraph."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
Or straight from the terminal, no server needed:
openfusion ask "Compare Postgres and SQLite for a small SaaS." --max-tokens 800
ask
runs one fusion against your configured panel and streams the synthesized answer to stdout
(panel progress goes to stderr). --max-tokens
caps every call β lower is faster and cheaper.
Speed & length.Fusion runs N panel calls plus a judge, so it's slower than one model β the panel runs in parallel and the judge streams as soon as the panel finishes. The judge is prompted to stay concise, and you cap length with--max-tokens
(CLI),max_tokens
(API), the response- length control in the playground Settings, orcost_controls
in config.
Three knobs control whether and how a prompt is fused. All are optional and off/default.
Auto Router(router.enabled: true
) β a per-prompt gate that answers simple prompts with a single pass-through call and reserves the panel for prompts that look like they benefit (long, analytical, or containing code). Default is a cheap heuristic (no extra model call);mode: model
uses a small classifier model and falls back to the heuristic if it errors:
router:
enabled: true
mode: heuristic # heuristic | model | always | never
min_chars: 280 # prompts at/over this length fuse
Strategy(strategy:
) β how the panel is produced:self_fusion
(one model sampled N times),panel
(a fixed diverse panel), ordebate
(a diverse panel where each member revises after seeing the others' answers, then the judge synthesizes). Debate trades extra cost/latency for cross-examination:
strategy: debate
debate:
rounds: 1 # revision rounds before the judge
Aggregator(aggregator:
) β how answers become one:judge
(synthesis, default),vote
(majority vote, cheaper, best for verifiable short-answer tasks), orranked
(one short judge call picks the single best answer β cheaper than synthesis, uses model judgment unlike vote). -
Analysis transparency(analysis.emit: true
) β surface the judge's structured reasoning (consensus / contradictions / partial coverage / unique insights / blind spots) as a separate SSEevent: analysis
(and ananalysis
field on non-streaming responses), without polluting the answer body. -
Prompt caching(cache.enabled: true
) β mark the shared prefix so self-fusion's N samples reuse a cached prompt on providers that support it (a no-op elsewhere).
For public deployments, bound load and spend (both default to 0
= unlimited):
limits:
max_in_flight: 64 # cap concurrent requests; over-limit returns 503
rate_limit_per_minute: 60 # per gateway key (or per client when unauthenticated); over-limit returns 429
These are best-effort, single-process guards β pair them with provider-side budgets and, for multi-replica deployments, an edge rate limiter.
A request to model: "openfusion"
is fanned out to a panel of models in parallel (each optionally doing its own web research), then a judge model reads every answer and synthesizes one β streamed back over SSE, with the structured analysis and cost alongside.
flowchart LR
C["Client<br/>(Cursor Β· OpenAI SDK Β· anything)"] -->|"POST /v1/chat/completions<br/>model=openfusion"| R{"Router<br/><i>(optional)</i>"}
R -->|simple prompt| S["Single model"] --> OUT
R -->|worth fusing| P
subgraph P ["Panel Β· parallel fan-out"]
direction TB
A["Model A π"]
B["Model B π"]
D["Model C π"]
end
P --> J["Judge<br/>consensus Β· contradictions Β· blind spots"]
J --> OUT["Streamed answer (SSE)<br/>+ analysis + token/cost"]
C -.->|other model / client tools| S
classDef accent fill:#eef2ff,stroke:#4f46e5,color:#3730a3;
class J,R accent;
Drop-in. OpenAI-compatiblePOST /v1/chat/completions
+/v1/models
, real SSE streaming.No lock-in. Each panel member + judge is{base_url, api_key, model}
. OpenRouter is the default upstream; OpenAI, Together, local vLLM/Ollama all work.Config-driven. Panel, judge, strategy, aggregator, router, and limits live inopenfusion.yaml
β or a one-wordpreset
, or nothing at all (zero-config quick start).
openfusion is the open implementation of the same idea. The core mechanism is at parity; the differences are scale and a per-prompt router.
| OpenRouter Fusion | openfusion | |
|---|---|---|
| Parallel panel β judge synthesis | β | β |
| Synthesis dimensions | consensus Β· contradictions Β· partial coverage Β· unique insights Β· blind spots | same |
| Web search + fetch on the panel | β (default) | β
(on by default with preset: ) |
| Quality / Budget presets | β | β (`preset: quality |
| Override panel + judge | β (plugin fields) | β
(any {base_url, api_key, model} in YAML) |
| Per-call cost breakdown | β (Activity) | β
(SSE usage event + /metrics ) |
| Self-hostable / forkable | β closed API | β MIT, any OpenAI-compatible provider |
| Per-prompt Auto Router | β | β
heuristic or model classifier (router.enabled ) |
| Structured analysis surfaced | β | β
analysis.emit (SSE analysis event) |
| Multi-round debate | β | β
strategy: debate |
| Concurrency cap + rate limiting | β | β
limits (best-effort, single-process) |
| Interactive web playground | β | β
embedded at /playground (zero-build) |
| Headline benchmark | full DRACO (100 tasks) | DRACO subset (10 tasks) β see |
| Parameter | Applies to | Notes |
|---|---|---|
temperature (client) |
||
| Judge only indirectly via recipe | Self-fusion varies panel temps from config, not client | |
max_tokens , stop , response_format |
||
| Judge (visible output) | Panel members use recipe defaults | |
stream , stream_options |
||
| Judge path | Panel always runs non-streamed internally | |
tools / tool_calls |
||
| Fusion or pass-through | Server-executable web tools (openrouter:web_search /web_fetch ) are fused; client-side function tools and mid-conversation tool turns pass through |
| Variable | Purpose |
|---|---|
OPENROUTER_API_KEY |
|
Default upstream key (via ${OPENROUTER_API_KEY} in config) |
|
OPENFUSION_CONFIG |
|
Path to config file (default: openfusion.yaml ) |
|
OPENFUSION_API_KEYS |
|
| Comma-separated gateway allowlist (optional) | |
OPENFUSION_HOST / OPENFUSION_PORT |
|
| Server bind address |
cost_controls
in config caps max_tokens
for pass-through, panel, and judge calls. Missing
max_tokens
values are filled from the configured ceiling; over-limit pass-through and judge
requests return 400
, while internal panel calls clamp to their ceiling.
Run the opt-in live OpenRouter smoke test only when you intend to spend a small number of credits:
export OPENROUTER_API_KEY=your-key
python scripts/openrouter_smoke.py --config examples/dev.yaml.example --yes-spend-credits
Run the head-to-head benchmark (self-fusion vs solo model):
pip install -e ".[dev]"
python bench/run.py --config examples/default.yaml.example --tasks bench/tasks/sample.jsonl
Use --tasks bench/tasks/smoke.jsonl --max-tokens 32
before larger benchmark runs.
Each run reports accuracy plus the spend it took to get there β total_tokens
and
total_cost_usd
per mode β so you can weigh any accuracy change against the extra cost of fanning out to a panel.
The bundled bench/tasks/sample.jsonl
(20 short Q&A tasks) is saturated for a capable model β
the solo baseline already scores ~100%, so there is no headroom for fusion to add accuracy. On a
recent run with openai/gpt-4o-mini
(self-fusion N=2, max_tokens=32
):
| Mode | Accuracy | Avg latency | Tokens | Cost |
|---|---|---|---|---|
| Solo | 100% (20/20) | 0.55s | 536 | $0.0001 |
| Self-fusion | 95% (19/20) | 1.40s | 4,669 | $0.0008 |
So on easy tasks fusion does not beat a single call β it costs more (here ~9Γ the tokens) and can even regress, because the judge only has trivially-correct answers to choose between. This is expected: mixture-of-agents helps where a single model is unreliable, not where it is already right.
openfusion makes
no"beats frontier" claim. Demonstrating where fusion earns its cost needs a harder eval (one the solo baseline does not already ace) scored onquality per dollar, not accuracy alone. That eval is in progress; this table will be updated to show where fusion does and doesn't pay off. Claim only what your ownbench/run.py
run proves on your model and tasks.
The proxy exposes Prometheus metrics at GET /metrics
(no auth; scrape-only, bind accordingly):
openfusion_requests_total{route,outcome}
β client-facing requests (fusion
/pass_through
).openfusion_upstream_requests_total{phase,outcome}
β upstream calls bypanel
/judge
/pass_through
.openfusion_panel_members_total{outcome}
β per-member success vs. degraded failures.openfusion_tokens_total{phase,kind}
andopenfusion_cost_usd_total{phase}
β token and cost spend.openfusion_request_latency_ms
/openfusion_upstream_latency_ms
β latency summaries (_count
+_sum
).
Cost (usage.cost
, when the upstream reports it) is also rolled into the per-request SSE
event: usage
payload and the non-streaming usage
field, so a single fusion call shows what it
spent across the panel and judge. Per-call structured logs remain on the openfusion.upstream
logger.
The server hosts an interactive playground at GET /playground
(and GET /
redirects there). It's
a React + Tailwind + shadcn UI whose built assets ship in the package (no Node needed to run);
it talks only to the local /v1
API, so provider keys never reach the browser. You can:
- paste your OpenRouter API key on first run (held only in server memory; enabled by
allow_ui_api_key
, on for the zero-config quick start), - pick a Quality / Budget / Custom panel and a "Fuse with" judge model, - toggle web search, send a prompt, and watch the panel β synthesis progress, - read the streamed answer plus the judge's structured analysis(consensus / contradictions / blind spots) and the** token + cost**breakdown.
The model selectors are editable when the server sets allow_request_overrides: true
(on for the
quick start), which enables the per-request openfusion: { preset | panel | judge | tools }
field
(mirroring OpenRouter Fusion's analysis_models
/model
plugin fields). Overrides reuse the
server's upstream credentials β clients choose model ids, never keys β and stay bounded by gateway
auth, cost ceilings, and rate limits. Read GET /v1/config
for the active panel/judge and flags.
The UI source lives in web/
(Vite + React + TypeScript + Tailwind v4 + shadcn-style components):
cd web
npm install
npm run dev # dev server (proxy /v1 to a running openfusion on :8000)
npm run build # writes built assets into openfusion/static/playground/ (commit them)
** openfusion: command not found** β the console script lives in the environment you installed it into. Either install it as a tool so it's always on
PATH
(uv tool install .
or pipx install .
),
or activate the venv you used (source .venv/bin/activate
). A bare pip install -e .
does not put
openfusion
on your global PATH
.Playground says "Couldn't reach the server" β open the page at the URL the running server prints
(default http://localhost:8000
), not a dev-server port or a standalone file.
** No upstream API key** β set
OPENROUTER_API_KEY
, run openfusion setup
, or paste your key into the playground.Backend: Python 3.11+ / FastAPI / httpx / uvicorn. Frontend: React / Vite / Tailwind / shadcn.
Contributions are welcome β openfusion is meant to be forked and tuned. See CONTRIBUTING.md for dev setup and the PR checklist, and CODE_OF_CONDUCT.md. Please report security issues privately per SECURITY.md rather than as a public issue.
MIT.