{"slug": "measuring-reliability-in-the-age-of-ai", "title": "Measuring Reliability in the Age of AI", "summary": "AI-generated code is increasing production failure rates and recovery times, according to industry evidence. To manage this risk, businesses must track metrics like change failure rate and mean time to recovery to distinguish between AI-introduced defects and integration failures.", "body_md": "AI has enabled businesses to generate code at an elevated rate. There is anecdotal industry evidence that production reliability has decreased. But, how to know this accurately for your business?\n\nCollect and act on metrics.\n\n## Why metrics matter\n\nMetrics matter because they turn an unpredictable system into a manageable one. They give you reliable sight of what is really happening.\n\nOrganizational risk is increasing due to AI-driven delivery, and this can impact customer experience and your personal accountability.\n\nMetrics support defensible governance during a period of rapid industry change.\n\n## Failure rate metrics\n\nThere is no industry-wide dataset that directly measures change failure rate before and after the adoption of AI for a delivery process that leads to production change.\n\nTo decide if your use of AI is making your delivery better or worse you need to measure your own situation to define a baseline.\n\nTo do this, you need to collect metrics that reflect the changing health of your delivery process and its effects on your production environment.\n\n## Four metrics\n\nTo surface a picture of your delivery and production health, you need to capture:\n\n- Change failure rate\n- MTTR\n- MTTD\n- Incident volume and severity\n\nIn the same way you have observability in place, the above metrics capture\nthe relationship between *how* you produce solutions to *what* effect they\nhave on your production system.\n\n### Change failure rate\n\nThis tells you how often your changes break production.\n\nProduction reliability problems ultimately begin with one of two things:\n\n- A change that should not have been deployed\n- A change that was deployed correctly but behaved incorrectly\n\nThe first is a decision failure as the change was wrong before it reached production: the code was incorrect, incomplete or logically flawed.\n\nThe second is a system-interaction failure. The change was valid in isolation (all quality checks showed the change was good to go) but, once deployed, the change interacted with production in an unexpected way that was not picked up earlier.\n\nWhen using AI, generated code can be:\n\n- plausible but wrong\n- incomplete\n- inconsistent\n- missing edge cases\n- violating invariants\n\nSuch code will increase the number of changes that should not have been deployed.\n\nAI use also increases change volume and this will affect more parts of your system. This increases:\n\n- integration risk\n- emergent behaviour\n- subtle regressions\n- interactions with legacy code\n\nThis increases the number of changes that behave incorrectly only after deployment.\n\n#### Interpreting CFR\n\nIf CFR rises, you need to know did your use of AI generate more incorrect changes, or did AI accelerate delivery and expose more integration failures?\n\nWithout separating the two, you cannot attribute the cause.\n\n### Mean time to recovery\n\nMTTR tells you how long you stay broken. It is the single best measure of operational resilience because if you are broken for 24 hours but your customers can still use your systems effectively, then you have a degree of resilience to system flaws as they do not negatively impact your customer.\n\nEven if CFR stays constant, MTTR can become worse because using AI can introduce more subtle defects (that are likely to be missed in a large volume of code generation), and your system become harder to correct.\n\n#### Subtle defects\n\nImagine your AI rewrites a database query for readability:\n\n```\nSELECT * FROM users WHERE id IN (SELECT user_id FROM sessions)\n```\n\nyou test the above and everything passes. Test uses 50 rows for users and 10 for sessions.\n\nBut, in production, users contains 10 million rows and sessions contains 50,000,000.\n\nGiven this different data environment, the first thing that happens is that the subquery (in parentheses) becomes unbounded. It will return every user_id in sessions.\n\nThe database will create an in-memory version of sessions to check for value\nmembership. This is because the SQL query tests for this using `IN`\n\n.\n*Every* value from sessions must be read.\n\nEven when we consider both sessions to have been built once in memory (and a membership test costs one unit of time) and that users is indexed, every row in users must still be scanned. And sequential scans are inherently expensive.\n\nThe performance outcome in this case is that a sequential scan of users is expensive, but when the IN list is huge (50 million values), every alternative database query plan is even more expensive, so the database optimiser chooses the scan as the least costly option. A scan will be faster than the alternatives but such a scan is still costly in terms of input/output and CPU use.\n\nA large sessions table leads to a huge `IN`\n\nlist which means any index on\nusers is of less value. Because of this the database query optimiser scans the whole\nof users. And in production, users contains 10,000,000 rows.\n\nIn short, the size of sessions trigger the database to choose a query plan that scans 10,000,000 rows.\n\nThis is a subtle issue to catch in test as test has used a tiny dataset.\n\nBut the key here is that the AI has no awareness of your production table sizes.\n\nAI has written a theoretically correct query that breaks down when exposed to the realities of production.\n\nThere is more to writing code than just the text. A full awareness of the environment in which that text is running is required. And the AI does not have that awareness.\n\nYour business becomes dependent on generated code that is not fully understood.\n\n#### The interaction of two table sizes on performance\n\nEngineers have an appreciation of this matrix.\n\nThe same query will operate differently each time it is run as the sizes of users and sessions vary. If they are both large, a worst case performance may occur.\n\nLarge here depends on the database you are using and the hardware enviornment it is running within.\n\nThe eventual performance of your code in dependent on factors outside of the\ncode. This is why it is crucial to check the behaviour of your code in a test\nenvironment that is an *accurate* reflection of your production environment.\nYour engineers and QA staff are aware of this. Your AI is not.\n\n| Table size | Small Sessions | Large Sessions |\n|---|---|---|\n| Small Users | • Fast query plans • Index use likely |\n• Subquery grows large but outer scan still cheap • Hash table from sessions is large but overall, still manageable |\n| Large Users | • Index use leads to good performance • Optimiser avoids full scan |\n• This is the worst case: • A huge IN list and a full scan of a large users table • Result: a high query time • The optimiser is forced into a sequential scan on users |\n\n#### Interpreting MTTR\n\nIf MTTR rises, users experience longer outages If MTTR falls, reliability is improving even if the change failure rate is unchanged\n\n### Mean time to detection\n\nMTTD tells you how long you remain unaware that you are broken.\n\nAI can affect this in two ways:\n\n- more subtle regressions so a broken production is harder to detect\n- more automated monitoring so a broken production is easier to detect\n\n#### More subtle regressions\n\nAI can generate code that can hide defects. And the defect may be subtle, as we have seen:\n\n- it only appears under real production data, not engineer-run tests\n- it only appears under real concurrency, not local runs\n- it only appears under real load, not pre-production quality staging\n- logs may not show anything suspicious\n- the behaviour may be intermittent, so alerts do not fire\n\n#### Interpreting MTTD\n\nThe system is broken, but nobody realises for longer. That is an increase in MTTD.\n\nIf MTTD increases, you are blind for longer. If MTTD decreases, you catch issues earlier.\n\n#### Using MTTD to interpret MTTR\n\nWithout MTTD, you cannot interpret MTTR correctly. This is because MTTD is a component of MTTR.\n\nMTTD is the time between a failure occuring and the failure being detected. This is the blindness window.\n\nMTTR after failure detection is the time from failure detection to recovery. This is the repair window.\n\nMTTR can refer to the sum of both these windows of time. But, the two windows behave differently, and they are influenced by AI in different ways.\n\nSepararting the two is important because if you only look at MTTR, you cannot tell whether:\n\n- detection slowed down\n- recovery slowed down\n- both slowed down\n- one improved while the other got worse\n\nTwo organisations with the same MTTR can have two totally different operational realities.\n\n#### Affecting MTTD with two types of MTTR\n\nThere are two types of MTTR:\n\n- detection-inclusive MTTR that includes the time you were unaware of the issue\n- post-detection MTTR: how fast you fix things once you are aware of the issue\n\nThe first shows how long users were affected. The second describes how fast engineering can recover.\n\n#### AI's effect on MTTD\n\nAI can both increase and decrease MTTD, depending on how you use it.\n\nAI can introduce subtle effects (as above with users and sessions database tables) that can:\n\n- pass tests\n- look plausible so passes code review\n- only appear under real load or real data\n- do not trigger alerts immediately\n\nThese will increase your blindness window.\n\nAI can improve:\n\n- anomaly detection\n- log analysis\n- metric correlation (not causation)\n- alert generation\n\nThe improvements reduce the blindness window.\n\nAI may increase or decrease the time for MTTR after detection, depending on whether:\n\n- your use of AI helps engineers debug faster\n- your use of AI produces code that is harder to reason about when run in poduction\n\nAnd you cannot know which effect dominates unless you measure the components separately.\n\n### Incident Volume and Severity\n\nThis tells you how often and how badly things go wrong.\n\nEven if your change failure rate and your mean time to recovery look stable, incident volume can rise because:\n\n- deployment frequency increases\n- AI accelerates code generation\n- system complexity increases\n- more third‑party dependencies fail\n\nIncident volume is the only metric that captures the total operational load on the organisation.\n\n## How has your use of AI affected your business?\n\nIf you are using AI operationally, you can measure what effect it is having by considering these metrics. Each metric must be normalised.\n\n| Metric | Before AI | After AI | Normalisiation Explanation |\n|---|---|---|---|\n| Deployments/week | X | Y | Adjusted for team size and release cadence |\n| Change Failure Rate | A% | B% | Calculated as failures per change, not absolute counts |\n| MTTR | M | N | Split into detection and recovery components |\n| P1/P2 incidents/month | U | V | Adjusted for deployment volume and service footprint |\n| Lines of code changed | L1 | L2 | Normalised per engineer to remove team size effects |\n| AI‑generated code (%) | 0% | K% | Expressed as a proportion of total code changes |\n\nNormalisation is required to show that metrics before and after the use of AI reflect real changes in performance and are not due to unrelated factors such as:\n\n- team size\n- deployment frequency\n- service footprint\n- code volume\n- organisational growth\n\nFor example, without normalisation, the metrics might be misleading:\n\n- if deployments double, incident count may rise even if quality improves\n- if the team grows, code generation volume increase even without AI\n- if the system footprint expands, MTTR may rise simply because more services exist\n\nConsider this: you double deployments, and your number of incidents doubles. Your quality has remained the same. But if you double the number of deployments, and the number of incidents increases by 50%, your quality has improved. Normalization is essentialy to interpret metrics within the context of overall values.\n\n## Why collect metrics?\n\nRegularly collecting and publishing metrics is essential because it replaces subjective, individual experience with objective, organisation‑wide evidence. Occasional anecdotes such as \"my use of AI has made it better for me\" cannot explain what is happening across your entire delivery system.\n\n## Overall, reliability might be decreasing\n\nTrends in commercial deployed systems have led to:\n\n- system complexity growth — more services, more dependencies, more failure modes\n- higher change velocity — more deployments equates to more opportunities for defects\n- increased integration surface — more APIs, more third‑party systems\n- cloud‑native architectures — distributed systems fail in subtle ways\n- operational load on teams — alert fatigue, burnout, staffing constraints\n- legacy systems under strain — older systems interacting with new ones\n- rising customer expectations — smaller incidents now count as \"outages\"\n\nRead next:\n\n[Before You Adopt AI in Engineering, Answer These Five Questions]\n\nCalculate your AI maturity with this article.\n\n## Related Articles\n\n[The Missing Structure Agile Cannot Fix](https://phroneses.com/articles/leadership/notes/the-missing-structure.html)[Building Safe, Compliant and Sustainable LLM Systems](https://phroneses.com/articles/leadership/notes/building-safe-llm-systems.html)[What Tech Executives Need to Know About Working With LLMs](https://phroneses.com/articles/leadership/notes/tech-executives-llms.html)\n\n**If this was useful**, you can get more pieces like it in the Phroneses newsletter.\n\nI work with leaders and teams on clarity, capability, and momentum.\n[Work with me →](/pages/services.html)", "url": "https://wpnews.pro/news/measuring-reliability-in-the-age-of-ai", "canonical_source": "https://phroneses.com/articles/leadership/notes/measuring-reliability-in-the-age-of-ai.html", "published_at": "2026-06-24 00:00:00+00:00", "updated_at": "2026-06-24 10:27:35.032018+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-safety", "ai-products", "developer-tools"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/measuring-reliability-in-the-age-of-ai", "markdown": "https://wpnews.pro/news/measuring-reliability-in-the-age-of-ai.md", "text": "https://wpnews.pro/news/measuring-reliability-in-the-age-of-ai.txt", "jsonld": "https://wpnews.pro/news/measuring-reliability-in-the-age-of-ai.jsonld"}}