Designing Software for Software Factories A software engineer describes building an AI-driven 'software factory' that automates the entire software development lifecycle from customer requests to deployment, requiring human intervention only for off-ramping and review. The system aims to measure cycle time, review volume, and off-ramp rates, with recommendations for AI-native startups to buy factory-as-a-service and enterprise companies to build dedicated AI-readiness platform teams. Designing Software for Software Factories Reflections on turning software engineering into agentic loops. Since my AI-powered Software Engineering https://blog.sshh.io/p/ai-powered-software-engineering 2024 post, the overton window has shifted from whether AI can even aid software development in any way to what parts of it even remain human. At work, we’ve been building out what I call our “software factory”, and this post I wanted to chat through how it’s come together, the hardest parts we’ve run into, and what actually works. What is a software factory? While it's becoming a bit of a buzzword with different definitions depending on who you ask — I define it as an AI-driven system and the organization that surrounds it that solutions, designs, builds, tests, and deploys software products. If you've read The Transposed Organization https://blog.sshh.io/p/the-transposed-organization , a "software factory" is just one of many 'loops' a modern AI native company must develop as a core part of EPD eng, product, design operations. IMO a full Software Factory must: Be able to operate on the raw distribution of customer generated RFEs and bug reports as input. It does not count if a PM needs to scope every ticket or an engineer needs to break the solution into smaller pieces. Only require humans for off-ramping pressing the big red stop button and review at certain stages. It does not count if there’s an explicit “pairing” step anywhere in the loop or if the system runs on an individual’s laptop https://background-agents.com/ .Feed every review back into the system such that the review gate deprecates itself over time. It does not count or work well if reviews only apply to a single instance of software generation. Be able to run many requests through the loop concurrently, not one ticket at a time. It does not count if requests must be serialized because stages share mutable state one test/staging env, one branch, one deploy slot or if throughput is capped by a human-owned resource rather than by spend. You measure the first order 1 footnote-1 efficacy of a software factory typically via: Cycle time — The wall time for customer request to deploy or per stage . Review volume — How much feedback is given across stages on AI-drafted outputs per stage . Off-ramps — How often a request gives up and falls back to a human-in-the-loop development process per stage . This can also be reframed as %-factory applicable. Seeding a software factory At the risk of being less useful, I'm going to focus this post on high level tips given you already have some semblance of a software factory setup. Exactly how one works, how end-to-end it goes, whether it's home grown or purchased is really going to vary company by company. I've mostly seen two buckets: AI-native startups Typically have a lot more room to build an AI-friendly tech stack and the contractual and compliance risks are typically lower. The downside is they can't afford to have a dedicated AI dev ops team to build any sort of "software factory platform". Recommendation : If Claude picked your stack, it's very likely you can actually just find something to buy as like a factory-as-a-service. It's also critical to set the expectation up front: with no central platform team, every engineer is the software-factory architect for the systems they own. Enterprise software companies me irl Moving to AI-friendly stacks becomes quite a large migration and the appetite for risk often around service stability is low to none. They do and have started software factory platform like teams. Recommendation : Today at least, it's likely easier to build than buy and do this via one more dedicated AI-readiness platform teams. A common failure mode is buying something with a low ceiling that actually can't get you to the same compliance and level of testing needed to actually ship a "real" feature. Another option, for those bold enough, is to fork all new products onto an AI-native stack but this only really pays off if it's isolated enough to not share the same compliance and stability risks. Don’t start a product or platform as a software factory Another perhaps counterintuitive observation is that software factories work best when there's patterns, contracts, and scaffolding to match against. Pure greenfield projects don't have this and pre-maturely factorifying has led to code-bloat and a reduced ability to understand the project as it evolves. Another risk is just lack of data for how the project evolves and will evolve — a factory works best when the system itself is "roadmap aware". Brownfield projects on the flip side are often complex, unintuitive, with significant undocumented behavior at least from a coding agents POV . With or without AI, a well-designed brownfield project is actually the ideal place to apply a factory. If you are just starting a project greenfield , I’d build the first 3-10k LoC pairing with a coding agent and same with the first few E2E features. At modern development time scales this phase should take around ~1-3 weeks. Product-prototype loops likely take longer and I consider those out of scope — for the most part those are just vibe coded demos that become an input to the actual production project. If you are working with a mega-complex enterprise service very brownfield , I'd apply these strategies: 1 Make all changes testable by an agent. To the deepest extent physically possible, an AI, with no human in the loop, should be able to tell you if a given PR breaks service functionality. 2 In a manual loop, prompt a coding agent to build out an upcoming feature don’t steer, just let it both design, implement, and test e2e . Adjust the code organization and/or sprinkle markdown files until this works. If 1 or 2 are infeasible at scale you should just rebuild or decompose and rebuild a subcomponent of the service to make those both true. I realize that's non-trivial but I'll make the claim that often eng hours to do this - eng hours reduced by having a factory « of eng hours to continue building features with engs in the loop Understand and maintain factories with contracts You can absolutely build and understand tens of thousands of lines of code shipped a day provided you have the right "contracts" for how those projects should be developed. Contracts can take a variety of forms schemas, typing, etc but typically my focus is on markdown ones often AGENTS.md or .md files co-located and referenced with the code . A good markdown contract: Makes predictions about how the future of the project will evolve what I mean by “roadmap” aware Makes it clear how to validate changes and what can’t be validated Does not change 90% of the time feature to feature e.g. % of future PRs that modify this file is very low Establishes clear complexity and risk boundaries e.g. x are examples of things that are easy, y are examples of things that are hard or risky Is forward-tested on roadmap tasks e.g. throwaway vibe code, with no steering, the next 3 items on the roadmap, if they are way off then the contract is bad Defines key context an agent won’t see e.g. often product and audience context, ~ a cache of “what should the agent know that’s not in the codebase” The “ I wrote a markdown file and my agent still sucks https://arxiv.org/abs/2602.11988 ” crowd often include 0 of 6 of these because often the easiest but wrong thing to do is to just use the markdown as a summary of the project or as a storage for one-off tech specs. Here’s some snippets from util skills I have used for writing these 2 footnote-2 : - Principles are the choices about what this service is and isn't, what it optimizes for, and what trade-offs it accepts. They are the decisions that produced the current code shape, not summaries of that shape. A good test: if you replaced every file in the directory with a different implementation that respected the principle, would the principle still read true? - Where does stuff live? The high-level layout a reader needs to know to find things — what kinds of files exist, what naming conventions disambiguate them, where shared helpers live, what subfolders mean. This should stay true across normal feature work; you should NOT have to edit it on every code change. Do not enumerate files — for the live file list, point the reader at a command e.g., ls