Show HN: Find where multi-agent AI systems break before production

wpnews.pro

Find where your multi-agent AI system breaks — before production does.

Static reliability testing for CrewAI, LangGraph, AutoGen, and custom agent systems. No live LLM calls, no API cost.

Chain 14 agents at 95% reliability each and your system is ~49% reliable end-to-end (0.95^14

). The failures aren't inside any single agent — they're in how they connect: silent cascade failures, hidden single points of failure, fragile dependencies. swarm-test finds them by analyzing your agent topology.

pip install swarm-test
swarm-test run my_crew.py --open

--open

launches an interactive D3 dashboard in your browser the moment the run finishes — Swarm Score, force-directed agent graph with single-points-of-failure pulsing red, sortable health and redundancy tables, and every finding grouped by severity.

No real script handy? Build a synthetic topology straight from the CLI:

swarm-test run -a "Orchestrator,Worker1,Worker2" -e "Orchestrator>Worker1,Orchestrator>Worker2"

One agent fails and silently takes down everything downstream — cascade failure - A single agent the whole system depends on; remove it and the swarm splits — blast radius / SPOF - Credentials, PII, or other sensitive data leaking across agent boundaries — context leakage - Agents drifting from their assigned role; prompt-injection-style goal hijacking — intent drift - A slow upstream with no timeout boundary blocking the whole pipeline — timeout resilience - Dense cliques, echo chambers, and cycles that bypass the orchestrator — collusion detection - Agents stuck in loops — runaway step counts and retry storms that burn tokens with no error thrown — trajectory analysis - Output schema mismatches across agent edges — contract violation(opt-in; provide a contracts YAML)
0–100 Swarm Score with a verdict line (EXCELLENT → CRITICAL) — one-line output for CI
Agent role classification (orchestrator, aggregator, validator, gateway, worker, monitor, router) with confidence scores
Role-adjusted severity — a validator leaking context is upgraded; an orchestrator's blast radius is downgraded
Historical tracking — trend across runs, diffs new vs. resolved findings
Interactive HTML report ( --open

) — D3 force-directed graph, NxN heatmap, filterable findings - GitHub Action with PR annotations and job-summary score

Graph export to Mermaid, DOT, or PNG (SPOFs red, redundant green)
Framework adapters: CrewAI, LangGraph, AutoGen, generic / static graph
YAML config ( .swarmtest.yml

) and entry-point plugin system

on: [pull_request]
jobs:
  swarm-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: surajkumar811/swarm-test@v0.3.0
        with:
          script: my_crew.py
          fail-on-severity: high

Findings appear inline on the PR as ::error::

/ ::warning::

/ ::notice::

annotations; the Swarm Score is posted to the workflow job summary.

from swarm_test import SwarmProbe

probe  = SwarmProbe(crew, swarm_name="my-crew")
report = probe.run_all()
report.print_summary()
report.to_html("report.html")
pip install swarm-test
pip install "swarm-test[crewai]"
pip install "swarm-test[langgraph]"
pip install "swarm-test[autogen]"
pip install "swarm-test[png]"        # for PNG graph export

How it works

swarm-test builds a NetworkX directed graph from your agent system — nodes are agents, edges are interactions extracted by each framework adapter. All tests are static graph analyses; no LLM calls are made, and results are deterministic given the same topology.

Cascade failure— simulates each agent failing in turn and measures downstream impact.** Blast radius**— detects articulation points (graph-theoretic SPOFs) and scores every agent on a 0–100 redundancy scale composed of path redundancy (30%), role uniqueness (25%), tool coverage (20%), betweenness centrality (15%), and degree ratio (10%).Context leakage— scans interaction payloads against a sensitive-data regex set extensible from.swarmtest.yml

.Intent drift— flags agents whose observed behavior diverges from their declared role; includes prompt-injection heuristics.** Collusion**— finds dense cliques, echo chambers, and cycles that bypass the declared orchestrator.** Timeout resilience**— identifies long synchronous chains with no timeout boundary.** Trajectory analysis**— flags self-loops, ping-pong pairs, multi-agent feedback cycles, unbounded loops with no exit, repeated parallel calls, and cycles deeper thanmax_trajectory_depth

(default 5).Contract violation— validates agent outputs against JSON schemas declared per edge (opt-in; pass--contracts contracts.yml

).

Roles are classified from structural metrics (in/out degree, betweenness centrality) plus naming hints, each with a 0–100% confidence score. Severity is then role-adjusted: an orchestrator with high blast radius is expected and gets downgraded; a validator leaking context is a security incident and gets upgraded.

Output modes & formats

Flag	Output
`--quiet` / `-q`
Headline verdict only (one line). Ideal for `if` checks in CI scripts.
(default)
Headline + test results + critical/high findings + SPOFs.
`--verbose` / `-V`
Every finding, graph metrics, full health and redundancy tables.

Output formats via --output-format

: console

, json

, markdown

, html

. The same verbosity setting is configurable in .swarmtest.yml

.

Graph export

swarm-test graph my_crew.py --format mermaid
swarm-test graph my_crew.py --format dot --output topology.dot
swarm-test graph my_crew.py --format png --output topology.png   # needs the [png] extra

Mermaid renders inline on GitHub, so you can drop the output straight into a README or PR description. Colors: red = SPOF, orange = moderate redundancy, green = fully redundant.

Historical tracking

Every run writes a small JSON snapshot to .swarmtest-history/

. Subsequent runs print a trend line below the headline verdict:

Swarm Score: 72/100 — NEEDS IMPROVEMENT (3 critical findings)
Trend: ↑ +18 from last run (was 54) — improving
Recent: 54 → 61 → 58 → 72
✓ 3 findings resolved since last run
⚠ 1 new finding since last run

Browse with swarm-test history show

. Disable per-run with --no-history

, or globally via history_enabled: false

in .swarmtest.yml

. .swarmtest-history/

is gitignored by default; commit it if you want the trend to survive across CI machines.

Configuration (.swarmtest.yml)

fail_on_severity: high        # critical | high | medium | low | info | none
max_blast_radius: 0.5         # 0.0 – 1.0
disabled_tests:
  - collusion
sensitive_patterns:
  - "INTERNAL-[A-Z0-9]+"
output_format: html
output_path: ./swarm.html
timeout_seconds: 30
strict: false                 # treat ANY finding as a failure

Auto-discovers .swarmtest.yml

, .swarmtest.yaml

, swarmtest.yml

, or a [tool.swarmtest]

table in pyproject.toml

. CLI flags always override config-file values. Exit codes from run

: 0

(passed), 1

(findings exceed thresholds), 2

(config or runtime error).

Plugin system

Ship custom tests as installable Python packages. Register under the swarm_test.plugins

entry-point group; swarm-test auto-discovers and runs them alongside the built-in tests:

[project.entry-points."swarm_test.plugins"]
my_custom_test = "my_package.plugins:MyPlugin"
swarm-test plugins list

See examples/plugin_template/ for a runnable starter.

Framework examples (CrewAI, LangGraph, AutoGen, static)

from crewai import Crew
from swarm_test import SwarmProbe
SwarmProbe(crew, swarm_name="my-crew").run_all().print_summary()

from langgraph.graph import StateGraph
from swarm_test import SwarmProbe
SwarmProbe(compiled_graph, swarm_name="my-langgraph").run_all().to_json("report.json")

from autogen import GroupChatManager
from swarm_test import SwarmProbe
SwarmProbe(manager, swarm_name="my-autogen").run_all().print_summary()

from swarm_test import SwarmProbe, AgentNode, InteractionEvent, EventType
a = AgentNode(name="Fetcher", role="researcher")
b = AgentNode(name="Summarizer", role="writer")
SwarmProbe(
    swarm_name="my-swarm",
    agents=[a, b],
    events=[InteractionEvent(source_agent_id=a.id, target_agent_id=b.id, event_type=EventType.TASK_DELEGATE)],
).run_all().print_summary()

PyPI:https://pypi.org/project/swarm-test/—pip install swarm-test

Issues:https://github.com/surajkumar811/swarm-test/issues** License:**MIT — free and open source

If swarm-test catches a real bug for you, please star the repo — it helps other teams find it.

source & further reading

github.com — original article

Show HN: Find where multi-agent AI systems break before production

Run your AI side-project on zahid.host