Show HN: Find where multi-agent AI systems break before production

A new open-source tool, swarm-test, lets developers find failures in multi-agent AI systems before production by analyzing agent topology statically, without live LLM calls. It detects cascade failures, single points of failure, context leakage, and other issues, outputting a Swarm Score and interactive D3 dashboard. The tool supports CrewAI, LangGraph, AutoGen, and custom systems, and integrates with GitHub Actions for PR annotations.

Find where your multi-agent AI system breaks — before production does. Static reliability testing for CrewAI, LangGraph, AutoGen, and custom agent systems. No live LLM calls, no API cost. Chain 14 agents at 95% reliability each and your system is ~49% reliable end-to-end 0.95^14 . The failures aren't inside any single agent — they're in how they connect: silent cascade failures, hidden single points of failure, fragile dependencies. swarm-test finds them by analyzing your agent topology. pip install swarm-test swarm-test run my crew.py --open --open launches an interactive D3 dashboard in your browser the moment the run finishes — Swarm Score, force-directed agent graph with single-points-of-failure pulsing red, sortable health and redundancy tables, and every finding grouped by severity. No real script handy? Build a synthetic topology straight from the CLI: swarm-test run -a "Orchestrator,Worker1,Worker2" -e "Orchestrator Worker1,Orchestrator Worker2" - One agent fails and silently takes down everything downstream — cascade failure - A single agent the whole system depends on; remove it and the swarm splits — blast radius / SPOF - Credentials, PII, or other sensitive data leaking across agent boundaries — context leakage - Agents drifting from their assigned role; prompt-injection-style goal hijacking — intent drift - A slow upstream with no timeout boundary blocking the whole pipeline — timeout resilience - Dense cliques, echo chambers, and cycles that bypass the orchestrator — collusion detection - Agents stuck in loops — runaway step counts and retry storms that burn tokens with no error thrown — trajectory analysis - Output schema mismatches across agent edges — contract violation opt-in; provide a contracts YAML - 0–100 Swarm Score with a verdict line EXCELLENT → CRITICAL — one-line output for CI - Agent role classification orchestrator, aggregator, validator, gateway, worker, monitor, router with confidence scores - Role-adjusted severity — a validator leaking context is upgraded; an orchestrator's blast radius is downgraded - Historical tracking — trend across runs, diffs new vs. resolved findings - Interactive HTML report --open — D3 force-directed graph, NxN heatmap, filterable findings - GitHub Action with PR annotations and job-summary score - Graph export to Mermaid, DOT, or PNG SPOFs red, redundant green - Framework adapters: CrewAI, LangGraph, AutoGen, generic / static graph - YAML config .swarmtest.yml and entry-point plugin system .github/workflows/swarm-test.yml on: pull request jobs: swarm-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: surajkumar811/swarm-test@v0.3.0 with: script: my crew.py fail-on-severity: high Findings appear inline on the PR as ::error:: / ::warning:: / ::notice:: annotations; the Swarm Score is posted to the workflow job summary. python from swarm test import SwarmProbe Works with a CrewAI Crew, LangGraph CompiledGraph, or AutoGen GroupChatManager probe = SwarmProbe crew, swarm name="my-crew" report = probe.run all report.print summary report.to html "report.html" pip install swarm-test or with framework extras: pip install "swarm-test crewai " pip install "swarm-test langgraph " pip install "swarm-test autogen " pip install "swarm-test png " for PNG graph export How it works swarm-test builds a NetworkX directed graph from your agent system — nodes are agents, edges are interactions extracted by each framework adapter. All tests are static graph analyses; no LLM calls are made, and results are deterministic given the same topology. Cascade failure — simulates each agent failing in turn and measures downstream impact. Blast radius — detects articulation points graph-theoretic SPOFs and scores every agent on a 0–100 redundancy scale composed of path redundancy 30% , role uniqueness 25% , tool coverage 20% , betweenness centrality 15% , and degree ratio 10% . Context leakage — scans interaction payloads against a sensitive-data regex set extensible from .swarmtest.yml . Intent drift — flags agents whose observed behavior diverges from their declared role; includes prompt-injection heuristics. Collusion — finds dense cliques, echo chambers, and cycles that bypass the declared orchestrator. Timeout resilience — identifies long synchronous chains with no timeout boundary. Trajectory analysis — flags self-loops, ping-pong pairs, multi-agent feedback cycles, unbounded loops with no exit, repeated parallel calls, and cycles deeper than max trajectory depth default 5 . Contract violation — validates agent outputs against JSON schemas declared per edge opt-in; pass --contracts contracts.yml . Roles are classified from structural metrics in/out degree, betweenness centrality plus naming hints, each with a 0–100% confidence score. Severity is then role-adjusted: an orchestrator with high blast radius is expected and gets downgraded; a validator leaking context is a security incident and gets upgraded. Output modes & formats | Flag | Output | |---|---| --quiet / -q | Headline verdict only one line . Ideal for if checks in CI scripts. | default | Headline + test results + critical/high findings + SPOFs. | --verbose / -V | Every finding, graph metrics, full health and redundancy tables. | Output formats via --output-format : console , json , markdown , html . The same verbosity setting is configurable in .swarmtest.yml . Graph export swarm-test graph my crew.py --format mermaid swarm-test graph my crew.py --format dot --output topology.dot swarm-test graph my crew.py --format png --output topology.png needs the png extra Mermaid renders inline on GitHub, so you can drop the output straight into a README or PR description. Colors: red = SPOF, orange = moderate redundancy, green = fully redundant. Historical tracking Every run writes a small JSON snapshot to .swarmtest-history/ . Subsequent runs print a trend line below the headline verdict: Swarm Score: 72/100 — NEEDS IMPROVEMENT 3 critical findings Trend: ↑ +18 from last run was 54 — improving Recent: 54 → 61 → 58 → 72 ✓ 3 findings resolved since last run ⚠ 1 new finding since last run Browse with swarm-test history show . Disable per-run with --no-history , or globally via history enabled: false in .swarmtest.yml . .swarmtest-history/ is gitignored by default; commit it if you want the trend to survive across CI machines. Configuration .swarmtest.yml fail on severity: high critical | high | medium | low | info | none max blast radius: 0.5 0.0 – 1.0 disabled tests: - collusion sensitive patterns: - "INTERNAL- A-Z0-9 +" output format: html output path: ./swarm.html timeout seconds: 30 strict: false treat ANY finding as a failure Auto-discovers .swarmtest.yml , .swarmtest.yaml , swarmtest.yml , or a tool.swarmtest table in pyproject.toml . CLI flags always override config-file values. Exit codes from run : 0 passed , 1 findings exceed thresholds , 2 config or runtime error . Plugin system Ship custom tests as installable Python packages. Register under the swarm test.plugins entry-point group; swarm-test auto-discovers and runs them alongside the built-in tests: project.entry-points."swarm test.plugins" my custom test = "my package.plugins:MyPlugin" swarm-test plugins list See examples/plugin template/ /surajkumar811/swarm-test/blob/main/examples/plugin template for a runnable starter. Framework examples CrewAI, LangGraph, AutoGen, static python CrewAI from crewai import Crew from swarm test import SwarmProbe SwarmProbe crew, swarm name="my-crew" .run all .print summary LangGraph from langgraph.graph import StateGraph from swarm test import SwarmProbe SwarmProbe compiled graph, swarm name="my-langgraph" .run all .to json "report.json" AutoGen from autogen import GroupChatManager from swarm test import SwarmProbe SwarmProbe manager, swarm name="my-autogen" .run all .print summary Static graph no live framework from swarm test import SwarmProbe, AgentNode, InteractionEvent, EventType a = AgentNode name="Fetcher", role="researcher" b = AgentNode name="Summarizer", role="writer" SwarmProbe swarm name="my-swarm", agents= a, b , events= InteractionEvent source agent id=a.id, target agent id=b.id, event type=EventType.TASK DELEGATE , .run all .print summary PyPI: https://pypi.org/project/swarm-test/ https://pypi.org/project/swarm-test/ — pip install swarm-test Issues: https://github.com/surajkumar811/swarm-test/issues https://github.com/surajkumar811/swarm-test/issues License: MIT — free and open source If swarm-test catches a real bug for you, please star the repo https://github.com/surajkumar811/swarm-test — it helps other teams find it.