Testing distributed systems with AI agents Methodology for testing distributed and stateful systems using AI coding agents that execute two specialized skills: one to design a claim-driven test plan and another to execute it. The approach produces structured Markdown artifacts—a test plan and a findings report with 9-state verdicts and explicit blame classification—allowing a human reviewer to decide whether to ship without re-running any tests. The workflow enforces principles such as starting from product claims rather than test cases, requiring coverage adequacy arguments, and combining fault injection with abstract models and consistency checkers for rigorous validation. Two skills for AI coding agents that design and run claim-driven tests for distributed and stateful systems. Together they produce a structured Markdown test plan and a findings report with 10-state verdicts and an explicit SUT / harness / checker / environment blame classification. A reviewer reads the two artifacts and decides whether to ship; nothing else has to be re-run. Works with Claude Code, Codex, Copilot CLI, Cursor, Gemini, or any agent that reads Markdown and runs shell. The skills are plain SKILL.md files. The agent executes them; the plan and findings report are the output. One skill designs the plan. The other runs it. A plan starts from the product's claims, generates hypotheses tied to those claims, and writes scenarios named after the claim each tries to falsify. For consistency-critical scenarios, each scenario also binds an abstract model register | queue | log | lock | lease | ledger | … to an operation-history schema, a named checker, and a nemesis with observable landing evidence. The plan ends with a coverage adequacy argument and a conservative confidence statement. The default for testing distributed and stateful systems — write a few integration tests and call it done — finds a small fraction of the bugs that actually break these systems in production: partial network partitions, non-deterministic concurrency, crash-recovery, upgrade/rollback, idempotency under replay, timing-sensitive ordering. These skills enforce an opinionated workflow that pulls from the field's hard-won knowledge: Claim-driven, not test-driven. Start from what the product promises. Every scenario falsifies one claim under one fault. A test named after its claim is harder to weaken than one named after its setup. Coverage adequacy is a deliverable. The plan ends with an argument that the chosen scenarios are enough to ship, plus an honest list of what stays unverified. Reuse the SUT's own toolbox. The execute skill discovers existing tests, runbooks, and fault-injection scaffolding before inventing anything new. Model + history + checker, not just chaos. For safety, durability, idempotency, isolation, ordering, or membership claims, every scenario declares an abstract model, an operation-history schema, a named checker linearizability, serializability, session-consistency, no-lost-ack, exactly-once, … , and how it treats ambiguous outcomes timeouts, unknown commits, retries . Chaos plus a model and a checker, not chaos alone. No silent passes. Every PASS cites oracle execution evidence and the signal proving the fault actually fired. Verdicts come from a 10-state set, so "the chaos script ran cleanly" can't be read as "the claim survived the fault." Every FAIL carries a SUT / harness / checker / environment blame tag so reproducers reach the right queue. End-to-end, the two skills produce: testing-plans/