I Tried to Design an Entire AI Software Testing Company. Here's the Architecture I'd Actually Build.

A developer designed TitanixAI, an autonomous AI software testing organization with 30 specialized agents, but concluded that the core architecture must prioritize a versioned artifact graph over agent count. The graph enables traceability and impact analysis when requirements change, with an approval lifecycle ensuring human oversight before any agent consumes artifacts.

Every few months a new idea arrives that sounds less like a product and more like a small company you could hire. Mine was called TitanixAI , and the pitch was simple enough to fit on a napkin and ambitious enough to keep me up at night: What if uploading a requirements document was like hiring an entire software testing company? Not a chatbot. Not a "generate test cases" button. Not another automation framework with an AI sticker on it. A complete, autonomous AI Software Quality Organization — with a Business Analyst that reads your SRS, a Product Owner that builds the roadmap, a Scrum Master that plans the sprint, a QA Manager that chooses the strategy, Manual Testers that write scenarios, Automation Engineers that generate runnable code, an Execution Agent that runs it all, and a Bug Agent that files the defects. Thirty specialized agents. Every decision explainable. Every output reviewable. Every action traceable. Humans always in the loop. It's a beautiful vision. It's also, as written, a five-year roadmap for a forty-person company described as a v1 spec. This article is the story of how I'd take an idea that grand and turn it into something a small team could actually ship — and the architectural decisions that matter far more than the agents everyone gets excited about. When you sketch an AI organization, the instinct is to list the org chart. CEO Agent. Project Director. System Architect. Scrum Master. QA Manager. Test Lead. Performance Tester. Security Tester. Accessibility Tester. Root Cause Agent. Meeting Agent. Knowledge Agent. Release Manager. Customer Success Agent. It feels like progress. It isn't. Here is the uncomfortable truth I had to sit with: agents are cheap to describe and brutal to make reliable. Writing "Performance Tester Agent" in a spec takes four seconds. Making an agent that produces a correct, runnable, trustworthy artifact — and knows when it's unsure — is the entire engineering problem. A list of thirty agents isn't an architecture. It's a wish list. And the single biggest risk to a project like this isn't technical difficulty — it's that you try to build all of it and ship none of it. So the first real decision wasn't "which agents?" It was: what is the smallest version that delivers the genuine wow, and earns the right to expand? If you remember one thing from this article, remember this: in a system like TitanixAI, the agents are the replaceable part. The durable, defensible core is something far less glamorous — the artifact graph. Think about what software testing actually is as a data problem. A requirement gives rise to epics, which give rise to user stories, which give rise to test cases, which give rise to automation code, which produces test runs, which produce bugs. Every one of those is connected to the things above and below it. Requirement → Epic → Story → Test Case → Automation → Test Run → Bug Now ask the question that makes this valuable: a requirement changes — what breaks? If your system is just thirty agents passing JSON to each other, you have no answer. But if every artifact is a versioned node in a traceable graph , you can walk the edges: this requirement feeds these three stories, which feed these eleven test cases, which feed this automation suite. Mark them stale. Regenerate. That impact analysis is a killer feature — and it's essentially free if you model the graph correctly from day one, and nearly impossible to bolt on later. So the rule I set was: build the graph before you build a single agent. And the graph carries something most "AI agent" demos quietly skip: an approval lifecycle on every node. DRAFT → PENDING REVIEW → APPROVED | REJECTED | REVISION REQUESTED With one ironclad constraint: no agent may consume an artifact that isn't APPROVED. The Automation Engineer never writes code from test cases a human hasn't signed off on. The Bug Agent never files defects from an unapproved run. That single rule is the difference between "an impressive demo" and "something an enterprise will actually trust with their quality process." Human-in-the-loop isn't a feature you sprinkle on top. It's a state machine you design first. Here's where I had to disappoint my own ambition. The vision covers web, mobile, desktop, microservices, IoT, even games. But for v1, I'd test exactly one thing: APIs. Not web UI. Not mobile. APIs. Why pick the boring one? Because the whole thesis lives or dies on a chain of deterministic, verifiable steps, and API testing is the only domain where every link is clean: Web UI testing is v2. Mobile is v3. Starting with APIs isn't lowering the bar — it's choosing the battlefield where you can actually win, then expanding from a position of strength. Modern AI architecture has a quiet financial trap. "Hundreds of AI employees collaborating on one upload" sounds magical right up until you realize it might mean thousands of LLM calls , and one upload costs $40 and takes 90 minutes. So the model layer needs to be smart about which brain handles which job. I'd build a model router where every agent declares a task class, and the router picks the model: | Task class | Who needs it | Model choice | Why | |---|---|---|---| Reasoning | Business Analyst, Root-Cause | Frontier Claude | Multi-step decomposition; quality compounds downstream | Code generation | Automation Engineer | Frontier Claude | Code that runs on the first try saves hours of debugging | Extraction | Spec parsing helpers | Local Qwen/DeepSeek | High-volume, structured, cheap | Bulk | Test data, boilerplate | Local Llama/Mistral | Low-risk, cost-sensitive | The mix matters. Local open models are fantastic for privacy and cost on bulk work — but on hard, multi-step reasoning like decomposing a messy requirement into correct test cases, the quality gap with frontier models is real and it shows up exactly where mistakes are most expensive. So: frontier brains for the hard thinking, local brains for the heavy lifting. And the non-negotiable: every single call logs its token cost. Cost-per-project should be a number on a dashboard from day one — not a surprise on your inference bill in month three. It's tempting to write your own agent orchestration framework. Resist it. The job here is state management, checkpointing, and — most importantly — pausing for human approval and resuming days later. That's exactly what mature graph-based orchestration frameworks already do well, including native human-in-the-loop interrupts. Use one. Build a thin, domain-specific layer on top. Revisit a custom engine only if the framework genuinely blocks you. The flow becomes a graph with human gates baked into the topology: ingest → BusinessAnalyst → HUMAN: approve requirements → TestDesigner → HUMAN: approve test cases → AutomationEng → HUMAN: approve code → Execution → BugReporter → done Each HUMAN is a real pause. The graph checkpoints its state, the UI surfaces the proposed artifacts, and nothing proceeds until someone clicks approve. A project can sit paused for a week and pick up exactly where it left off. That's enterprise-grade — not the number of agents, but the discipline of the gates. This is the part that should keep you honest. We're building a quality company. So here's the question that has to be answered before you ship anything: How do you know the AI's output is actually correct? A confidently-wrong test case that "passes" is worse than no test at all — it manufactures false assurance, which is the exact opposite of what a QA organization exists to provide. Hallucinated tests don't just fail to help; they actively erode trust in the entire system. So the system needs to be measured like any other quality-critical software: Build this in week two, not month six. An AI quality product whose own quality is unmeasured is a contradiction. When people imagine an AI agent platform, they picture the agents doing clever things autonomously. But for a tool that humans must trust with their software quality, the most important screen isn't the agent activity feed. It's the review queue. A v1 needs only three screens: Every other dashboard — executive KPIs, sprint burndowns, velocity charts — can wait. They're views over data the first five agents produce. Build the data first. Here's how I'd sequence it — each milestone proving exactly one thing: | | Milestone | What it proves | |---|---|---| | 0 | Dev environment up Postgres, Redis, Ollama, storage | The ground is solid | | 1 | Artifact-graph schema + approval state machine | The moat exists | | 2 | Model router with cost logging | Costs are under control | | 3 | OpenAPI/Postman ingestion → requirement nodes | Input becomes graph | | 4 | Business Analyst agent + review queue + traceability view | First full human-in-the-loop cycle | | 5 | Test Designer agent → approved test cases | Real domain value | | 6 | Evaluation harness + golden dataset | The AI can be trusted | | 7 | Automation agent → runnable Pytest | Code generation quality holds | | 8 | Sandboxed executor + run reports | Real results from real systems | | 9 | Bug Reporter + end-to-end demo | The entire thesis, proven | Ship after milestone nine. Then — and only then — start adding agents, project types, and integrations. Every item from the original grand vision becomes an expansion of a working core instead of a slide in a pitch deck. I started wanting to build an AI company with thirty employees. I ended with a plan for five agents, one input type, and three screens — and I'm more confident in that than I ever was in the org chart. The pattern generalizes far past testing tools. When you design with AI agents, the temptation is always to add more agents, because they're so easy to imagine. But the engineering reality keeps pointing the other way: The grand vision isn't wrong. It's the destination . But you don't get there by building the whole city at once. You build one street that works end to end, prove people want to walk down it, and earn the right to build the next one. TitanixAI might still become a full autonomous AI quality organization someday. But it'll get there one approved artifact at a time. If you're building with AI agents and wrestling with the same "how do I scope this down without killing the vision" tension, I'd genuinely like to hear how you're drawing the line. The comments are open.