Every testing tool for AI agents tests individual agents. But production failures don't happen inside agents — they happen between them.
I learned this the hard way.
I built a 14-agent document processing system using CrewAI. Each agent worked perfectly in isolation. In production, the system failed constantly — and I couldn't figure out why.
The problem wasn't any single agent. It was the interactions:
No existing tool could find these issues. Arize, Langfuse, Braintrust — they all monitor individual agents. None of them test the graph of agent interactions.
So I built one.
swarm-test builds a NetworkX interaction graph of your multi-agent system and runs 6 chaos engineering tests against it:
3-line API:
from swarm_test import SwarmProbe
probe = SwarmProbe(crew)
report = probe.run_all()
report.print_summary()
I ran swarm-test on my 14-agent system. The results were brutal:
54 total findings:
The worst agent: OrchestratorAgent scored 4 out of 100. It's a single point of failure with 92% blast radius — if it fails, 12 of 14 agents go down. And it had zero timeout handling.
The scariest finding: EvolutionAgent has 100% blast radius. If it fails, every other agent in the system is affected.
Three agents (OrchestratorAgent, FileOptimizerAgent, PrintOptimizerAgent) formed a collusion clique — communicating directly with each other and bypassing orchestrator oversight.
None of this was visible from testing individual agents. It only appeared when I tested the interaction graph.
After launching, I shipped one feature every day:
| Day | Feature | Impact |
|---|---|---|
| 0 | Launch — 5 chaos tests, GitHub + PyPI | First multi-agent testing tool on PyPI |
| 1 | Timeout resilience test | Found 22 new issues in my system |
| 2 | JSON export | Another developer integrated it into his runtime gate within hours |
| 3 | LangGraph adapter | Now supports CrewAI + LangGraph |
| 4 | Sensitive data detection (23 patterns) | Catches AWS keys, JWT tokens, credit cards crossing agent boundaries |
| 5 | Per-agent health scores (0-100) | Know exactly which agent to fix first |
| 6 | Before/after comparison | Measure if your refactor actually improved reliability |
| 7 | ASCII agent graph | See your agent topology right in the terminal |
94 tests passing. Two frameworks supported. And growing.
Within 48 hours of launch, another developer built an integration. He has a runtime action-gate that blocks dangerous agent actions before execution. He connected swarm-test's findings as "priors" — so when swarm-test flags an edge as high-risk, his gate becomes more cautious on that edge.
The result: the same run_sql
action went from "CONFIRM" (risk 62) to "HUMAN_REQUIRED" (risk 78) when swarm-test's cascade finding was attached.
Structural testing (swarm-test) + runtime enforcement (his gate) = the full reliability stack for multi-agent systems.
According to recent industry research:
Multi-agent systems are going to production faster than anyone can secure them. The tools exist for single-agent monitoring. Nothing existed for multi-agent interaction testing — until now.
pip install swarm-test
python
from swarm_test import SwarmProbe
probe = SwarmProbe(your_crew)
report = probe.run_all()
report.print_summary()
report.to_html("report.html") # Interactive D3 graph
report.to_json("report.json") # Machine-readable for CI/CD
GitHub: github.com/surajkumar811/swarm-test
Open source. MIT licensed. Solo founder building in public.
What reliability tests would YOU want for your multi-agent systems? Drop a comment — I'm shipping features based on real feedback.