{"slug": "show-hn-find-where-multi-agent-ai-systems-break-before-production", "title": "Show HN: Find where multi-agent AI systems break before production", "summary": "A new open-source tool, swarm-test, lets developers find failures in multi-agent AI systems before production by analyzing agent topology statically, without live LLM calls. It detects cascade failures, single points of failure, context leakage, and other issues, outputting a Swarm Score and interactive D3 dashboard. The tool supports CrewAI, LangGraph, AutoGen, and custom systems, and integrates with GitHub Actions for PR annotations.", "body_md": "**Find where your multi-agent AI system breaks — before production does.**\n\nStatic reliability testing for CrewAI, LangGraph, AutoGen, and custom agent systems. No live LLM calls, no API cost.\n\nChain 14 agents at 95% reliability each and your system is ~49% reliable end-to-end (`0.95^14`\n\n). The failures aren't inside any single agent — they're in how they connect: silent cascade failures, hidden single points of failure, fragile dependencies. swarm-test finds them by analyzing your agent topology.\n\n```\npip install swarm-test\nswarm-test run my_crew.py --open\n```\n\n`--open`\n\nlaunches an interactive D3 dashboard in your browser the moment the run finishes — Swarm Score, force-directed agent graph with single-points-of-failure pulsing red, sortable health and redundancy tables, and every finding grouped by severity.\n\nNo real script handy? Build a synthetic topology straight from the CLI:\n\n```\nswarm-test run -a \"Orchestrator,Worker1,Worker2\" -e \"Orchestrator>Worker1,Orchestrator>Worker2\"\n```\n\n- One agent fails and silently takes down everything downstream —\n*cascade failure* - A single agent the whole system depends on; remove it and the swarm splits —\n*blast radius / SPOF* - Credentials, PII, or other sensitive data leaking across agent boundaries —\n*context leakage* - Agents drifting from their assigned role; prompt-injection-style goal hijacking —\n*intent drift* - A slow upstream with no timeout boundary blocking the whole pipeline —\n*timeout resilience* - Dense cliques, echo chambers, and cycles that bypass the orchestrator —\n*collusion detection* - Agents stuck in loops — runaway step counts and retry storms that burn tokens with no error thrown —\n*trajectory analysis* - Output schema mismatches across agent edges —\n*contract violation*(opt-in; provide a contracts YAML)\n\n- 0–100 Swarm Score with a verdict line (EXCELLENT → CRITICAL) — one-line output for CI\n- Agent role classification (orchestrator, aggregator, validator, gateway, worker, monitor, router) with confidence scores\n- Role-adjusted severity — a validator leaking context is upgraded; an orchestrator's blast radius is downgraded\n- Historical tracking — trend across runs, diffs new vs. resolved findings\n- Interactive HTML report (\n`--open`\n\n) — D3 force-directed graph, NxN heatmap, filterable findings - GitHub Action with PR annotations and job-summary score\n- Graph export to Mermaid, DOT, or PNG (SPOFs red, redundant green)\n- Framework adapters: CrewAI, LangGraph, AutoGen, generic / static graph\n- YAML config (\n`.swarmtest.yml`\n\n) and entry-point plugin system\n\n```\n# .github/workflows/swarm-test.yml\non: [pull_request]\njobs:\n  swarm-test:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: surajkumar811/swarm-test@v0.3.0\n        with:\n          script: my_crew.py\n          fail-on-severity: high\n```\n\nFindings appear inline on the PR as `::error::`\n\n/ `::warning::`\n\n/ `::notice::`\n\nannotations; the Swarm Score is posted to the workflow job summary.\n\n``` python\nfrom swarm_test import SwarmProbe\n\n# Works with a CrewAI Crew, LangGraph CompiledGraph, or AutoGen GroupChatManager\nprobe  = SwarmProbe(crew, swarm_name=\"my-crew\")\nreport = probe.run_all()\nreport.print_summary()\nreport.to_html(\"report.html\")\npip install swarm-test\n# or with framework extras:\npip install \"swarm-test[crewai]\"\npip install \"swarm-test[langgraph]\"\npip install \"swarm-test[autogen]\"\npip install \"swarm-test[png]\"        # for PNG graph export\n```\n\n**How it works**\n\nswarm-test builds a NetworkX directed graph from your agent system — nodes are agents, edges are interactions extracted by each framework adapter. All tests are static graph analyses; no LLM calls are made, and results are deterministic given the same topology.\n\n**Cascade failure**— simulates each agent failing in turn and measures downstream impact.** Blast radius**— detects articulation points (graph-theoretic SPOFs) and scores every agent on a 0–100 redundancy scale composed of path redundancy (30%), role uniqueness (25%), tool coverage (20%), betweenness centrality (15%), and degree ratio (10%).**Context leakage**— scans interaction payloads against a sensitive-data regex set extensible from`.swarmtest.yml`\n\n.**Intent drift**— flags agents whose observed behavior diverges from their declared role; includes prompt-injection heuristics.** Collusion**— finds dense cliques, echo chambers, and cycles that bypass the declared orchestrator.** Timeout resilience**— identifies long synchronous chains with no timeout boundary.** Trajectory analysis**— flags self-loops, ping-pong pairs, multi-agent feedback cycles, unbounded loops with no exit, repeated parallel calls, and cycles deeper than`max_trajectory_depth`\n\n(default 5).**Contract violation**— validates agent outputs against JSON schemas declared per edge (opt-in; pass`--contracts contracts.yml`\n\n).\n\nRoles are classified from structural metrics (in/out degree, betweenness centrality) plus naming hints, each with a 0–100% confidence score. Severity is then role-adjusted: an orchestrator with high blast radius is expected and gets downgraded; a validator leaking context is a security incident and gets upgraded.\n\n**Output modes & formats**\n\n| Flag | Output |\n|---|---|\n`--quiet` / `-q` |\nHeadline verdict only (one line). Ideal for `if` checks in CI scripts. |\n(default) |\nHeadline + test results + critical/high findings + SPOFs. |\n`--verbose` / `-V` |\nEvery finding, graph metrics, full health and redundancy tables. |\n\nOutput formats via `--output-format`\n\n: `console`\n\n, `json`\n\n, `markdown`\n\n, `html`\n\n. The same verbosity setting is configurable in `.swarmtest.yml`\n\n.\n\n**Graph export**\n\n```\nswarm-test graph my_crew.py --format mermaid\nswarm-test graph my_crew.py --format dot --output topology.dot\nswarm-test graph my_crew.py --format png --output topology.png   # needs the [png] extra\n```\n\nMermaid renders inline on GitHub, so you can drop the output straight into a README or PR description. Colors: red = SPOF, orange = moderate redundancy, green = fully redundant.\n\n**Historical tracking**\n\nEvery run writes a small JSON snapshot to `.swarmtest-history/`\n\n. Subsequent runs print a trend line below the headline verdict:\n\n```\nSwarm Score: 72/100 — NEEDS IMPROVEMENT (3 critical findings)\nTrend: ↑ +18 from last run (was 54) — improving\nRecent: 54 → 61 → 58 → 72\n✓ 3 findings resolved since last run\n⚠ 1 new finding since last run\n```\n\nBrowse with `swarm-test history show`\n\n. Disable per-run with `--no-history`\n\n, or globally via `history_enabled: false`\n\nin `.swarmtest.yml`\n\n. `.swarmtest-history/`\n\nis gitignored by default; commit it if you want the trend to survive across CI machines.\n\n**Configuration (.swarmtest.yml)**\n\n```\nfail_on_severity: high        # critical | high | medium | low | info | none\nmax_blast_radius: 0.5         # 0.0 – 1.0\ndisabled_tests:\n  - collusion\nsensitive_patterns:\n  - \"INTERNAL-[A-Z0-9]+\"\noutput_format: html\noutput_path: ./swarm.html\ntimeout_seconds: 30\nstrict: false                 # treat ANY finding as a failure\n```\n\nAuto-discovers `.swarmtest.yml`\n\n, `.swarmtest.yaml`\n\n, `swarmtest.yml`\n\n, or a `[tool.swarmtest]`\n\ntable in `pyproject.toml`\n\n. CLI flags always override config-file values. Exit codes from `run`\n\n: `0`\n\n(passed), `1`\n\n(findings exceed thresholds), `2`\n\n(config or runtime error).\n\n**Plugin system**\n\nShip custom tests as installable Python packages. Register under the `swarm_test.plugins`\n\nentry-point group; swarm-test auto-discovers and runs them alongside the built-in tests:\n\n```\n[project.entry-points.\"swarm_test.plugins\"]\nmy_custom_test = \"my_package.plugins:MyPlugin\"\nswarm-test plugins list\n```\n\nSee [ examples/plugin_template/](/surajkumar811/swarm-test/blob/main/examples/plugin_template) for a runnable starter.\n\n**Framework examples (CrewAI, LangGraph, AutoGen, static)**\n\n``` python\n# CrewAI\nfrom crewai import Crew\nfrom swarm_test import SwarmProbe\nSwarmProbe(crew, swarm_name=\"my-crew\").run_all().print_summary()\n\n# LangGraph\nfrom langgraph.graph import StateGraph\nfrom swarm_test import SwarmProbe\nSwarmProbe(compiled_graph, swarm_name=\"my-langgraph\").run_all().to_json(\"report.json\")\n\n# AutoGen\nfrom autogen import GroupChatManager\nfrom swarm_test import SwarmProbe\nSwarmProbe(manager, swarm_name=\"my-autogen\").run_all().print_summary()\n\n# Static graph (no live framework)\nfrom swarm_test import SwarmProbe, AgentNode, InteractionEvent, EventType\na = AgentNode(name=\"Fetcher\", role=\"researcher\")\nb = AgentNode(name=\"Summarizer\", role=\"writer\")\nSwarmProbe(\n    swarm_name=\"my-swarm\",\n    agents=[a, b],\n    events=[InteractionEvent(source_agent_id=a.id, target_agent_id=b.id, event_type=EventType.TASK_DELEGATE)],\n).run_all().print_summary()\n```\n\n**PyPI:**[https://pypi.org/project/swarm-test/](https://pypi.org/project/swarm-test/)—`pip install swarm-test`\n\n**Issues:**[https://github.com/surajkumar811/swarm-test/issues](https://github.com/surajkumar811/swarm-test/issues)** License:**MIT — free and open source\n\nIf swarm-test catches a real bug for you, please [star the repo](https://github.com/surajkumar811/swarm-test) — it helps other teams find it.", "url": "https://wpnews.pro/news/show-hn-find-where-multi-agent-ai-systems-break-before-production", "canonical_source": "https://github.com/surajkumar811/swarm-test", "published_at": "2026-06-25 03:57:45+00:00", "updated_at": "2026-06-25 04:13:43.226857+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "ai-safety", "developer-tools"], "entities": ["swarm-test", "CrewAI", "LangGraph", "AutoGen", "NetworkX", "GitHub Actions", "D3"], "alternates": {"html": "https://wpnews.pro/news/show-hn-find-where-multi-agent-ai-systems-break-before-production", "markdown": "https://wpnews.pro/news/show-hn-find-where-multi-agent-ai-systems-break-before-production.md", "text": "https://wpnews.pro/news/show-hn-find-where-multi-agent-ai-systems-break-before-production.txt", "jsonld": "https://wpnews.pro/news/show-hn-find-where-multi-agent-ai-systems-break-before-production.jsonld"}}