cd /news/artificial-intelligence/measuring-reliability-in-the-age-of-… · home topics artificial-intelligence article
[ARTICLE · art-37582] src=phroneses.com ↗ pub= topic=artificial-intelligence verified=true sentiment=↓ negative

Measuring Reliability in the Age of AI

AI-generated code is increasing production failure rates and recovery times, according to industry evidence. To manage this risk, businesses must track metrics like change failure rate and mean time to recovery to distinguish between AI-introduced defects and integration failures.

read11 min views1 publishedJun 24, 2026
Measuring Reliability in the Age of AI
Image: Phroneses (auto-discovered)

AI has enabled businesses to generate code at an elevated rate. There is anecdotal industry evidence that production reliability has decreased. But, how to know this accurately for your business?

Collect and act on metrics.

Why metrics matter #

Metrics matter because they turn an unpredictable system into a manageable one. They give you reliable sight of what is really happening.

Organizational risk is increasing due to AI-driven delivery, and this can impact customer experience and your personal accountability.

Metrics support defensible governance during a period of rapid industry change.

Failure rate metrics #

There is no industry-wide dataset that directly measures change failure rate before and after the adoption of AI for a delivery process that leads to production change.

To decide if your use of AI is making your delivery better or worse you need to measure your own situation to define a baseline.

To do this, you need to collect metrics that reflect the changing health of your delivery process and its effects on your production environment.

Four metrics #

To surface a picture of your delivery and production health, you need to capture:

  • Change failure rate
  • MTTR
  • MTTD
  • Incident volume and severity

In the same way you have observability in place, the above metrics capture the relationship between how you produce solutions to what effect they have on your production system.

Change failure rate

This tells you how often your changes break production.

Production reliability problems ultimately begin with one of two things:

  • A change that should not have been deployed
  • A change that was deployed correctly but behaved incorrectly

The first is a decision failure as the change was wrong before it reached production: the code was incorrect, incomplete or logically flawed.

The second is a system-interaction failure. The change was valid in isolation (all quality checks showed the change was good to go) but, once deployed, the change interacted with production in an unexpected way that was not picked up earlier.

When using AI, generated code can be:

  • plausible but wrong
  • incomplete
  • inconsistent
  • missing edge cases
  • violating invariants

Such code will increase the number of changes that should not have been deployed.

AI use also increases change volume and this will affect more parts of your system. This increases:

  • integration risk
  • emergent behaviour
  • subtle regressions
  • interactions with legacy code

This increases the number of changes that behave incorrectly only after deployment.

Interpreting CFR

If CFR rises, you need to know did your use of AI generate more incorrect changes, or did AI accelerate delivery and expose more integration failures?

Without separating the two, you cannot attribute the cause.

Mean time to recovery

MTTR tells you how long you stay broken. It is the single best measure of operational resilience because if you are broken for 24 hours but your customers can still use your systems effectively, then you have a degree of resilience to system flaws as they do not negatively impact your customer.

Even if CFR stays constant, MTTR can become worse because using AI can introduce more subtle defects (that are likely to be missed in a large volume of code generation), and your system become harder to correct.

Subtle defects

Imagine your AI rewrites a database query for readability:

SELECT * FROM users WHERE id IN (SELECT user_id FROM sessions)

you test the above and everything passes. Test uses 50 rows for users and 10 for sessions.

But, in production, users contains 10 million rows and sessions contains 50,000,000.

Given this different data environment, the first thing that happens is that the subquery (in parentheses) becomes unbounded. It will return every user_id in sessions.

The database will create an in-memory version of sessions to check for value membership. This is because the SQL query tests for this using IN

. Every value from sessions must be read.

Even when we consider both sessions to have been built once in memory (and a membership test costs one unit of time) and that users is indexed, every row in users must still be scanned. And sequential scans are inherently expensive.

The performance outcome in this case is that a sequential scan of users is expensive, but when the IN list is huge (50 million values), every alternative database query plan is even more expensive, so the database optimiser chooses the scan as the least costly option. A scan will be faster than the alternatives but such a scan is still costly in terms of input/output and CPU use.

A large sessions table leads to a huge IN

list which means any index on users is of less value. Because of this the database query optimiser scans the whole of users. And in production, users contains 10,000,000 rows.

In short, the size of sessions trigger the database to choose a query plan that scans 10,000,000 rows.

This is a subtle issue to catch in test as test has used a tiny dataset.

But the key here is that the AI has no awareness of your production table sizes.

AI has written a theoretically correct query that breaks down when exposed to the realities of production.

There is more to writing code than just the text. A full awareness of the environment in which that text is running is required. And the AI does not have that awareness.

Your business becomes dependent on generated code that is not fully understood.

The interaction of two table sizes on performance

Engineers have an appreciation of this matrix.

The same query will operate differently each time it is run as the sizes of users and sessions vary. If they are both large, a worst case performance may occur.

Large here depends on the database you are using and the hardware enviornment it is running within.

The eventual performance of your code in dependent on factors outside of the code. This is why it is crucial to check the behaviour of your code in a test environment that is an accurate reflection of your production environment. Your engineers and QA staff are aware of this. Your AI is not.

Table size Small Sessions Large Sessions
Small Users • Fast query plans • Index use likely
• Subquery grows large but outer scan still cheap • Hash table from sessions is large but overall, still manageable
Large Users • Index use leads to good performance • Optimiser avoids full scan
• This is the worst case: • A huge IN list and a full scan of a large users table • Result: a high query time • The optimiser is forced into a sequential scan on users

Interpreting MTTR

If MTTR rises, users experience longer outages If MTTR falls, reliability is improving even if the change failure rate is unchanged

Mean time to detection

MTTD tells you how long you remain unaware that you are broken.

AI can affect this in two ways:

  • more subtle regressions so a broken production is harder to detect
  • more automated monitoring so a broken production is easier to detect

More subtle regressions

AI can generate code that can hide defects. And the defect may be subtle, as we have seen:

  • it only appears under real production data, not engineer-run tests
  • it only appears under real concurrency, not local runs
  • it only appears under real load, not pre-production quality staging
  • logs may not show anything suspicious
  • the behaviour may be intermittent, so alerts do not fire

Interpreting MTTD

The system is broken, but nobody realises for longer. That is an increase in MTTD.

If MTTD increases, you are blind for longer. If MTTD decreases, you catch issues earlier.

Using MTTD to interpret MTTR

Without MTTD, you cannot interpret MTTR correctly. This is because MTTD is a component of MTTR.

MTTD is the time between a failure occuring and the failure being detected. This is the blindness window.

MTTR after failure detection is the time from failure detection to recovery. This is the repair window.

MTTR can refer to the sum of both these windows of time. But, the two windows behave differently, and they are influenced by AI in different ways.

Separarting the two is important because if you only look at MTTR, you cannot tell whether:

  • detection slowed down
  • recovery slowed down
  • both slowed down
  • one improved while the other got worse

Two organisations with the same MTTR can have two totally different operational realities.

Affecting MTTD with two types of MTTR

There are two types of MTTR:

  • detection-inclusive MTTR that includes the time you were unaware of the issue
  • post-detection MTTR: how fast you fix things once you are aware of the issue

The first shows how long users were affected. The second describes how fast engineering can recover.

AI's effect on MTTD

AI can both increase and decrease MTTD, depending on how you use it.

AI can introduce subtle effects (as above with users and sessions database tables) that can:

  • pass tests
  • look plausible so passes code review
  • only appear under real load or real data
  • do not trigger alerts immediately

These will increase your blindness window.

AI can improve:

  • anomaly detection
  • log analysis
  • metric correlation (not causation)
  • alert generation

The improvements reduce the blindness window.

AI may increase or decrease the time for MTTR after detection, depending on whether:

  • your use of AI helps engineers debug faster
  • your use of AI produces code that is harder to reason about when run in poduction

And you cannot know which effect dominates unless you measure the components separately.

Incident Volume and Severity

This tells you how often and how badly things go wrong.

Even if your change failure rate and your mean time to recovery look stable, incident volume can rise because:

  • deployment frequency increases
  • AI accelerates code generation
  • system complexity increases
  • more third‑party dependencies fail

Incident volume is the only metric that captures the total operational load on the organisation.

How has your use of AI affected your business? #

If you are using AI operationally, you can measure what effect it is having by considering these metrics. Each metric must be normalised.

Metric Before AI After AI Normalisiation Explanation
Deployments/week X Y Adjusted for team size and release cadence
Change Failure Rate A% B% Calculated as failures per change, not absolute counts
MTTR M N Split into detection and recovery components
P1/P2 incidents/month U V Adjusted for deployment volume and service footprint
Lines of code changed L1 L2 Normalised per engineer to remove team size effects
AI‑generated code (%) 0% K% Expressed as a proportion of total code changes

Normalisation is required to show that metrics before and after the use of AI reflect real changes in performance and are not due to unrelated factors such as:

  • team size
  • deployment frequency
  • service footprint
  • code volume
  • organisational growth

For example, without normalisation, the metrics might be misleading:

  • if deployments double, incident count may rise even if quality improves
  • if the team grows, code generation volume increase even without AI
  • if the system footprint expands, MTTR may rise simply because more services exist

Consider this: you double deployments, and your number of incidents doubles. Your quality has remained the same. But if you double the number of deployments, and the number of incidents increases by 50%, your quality has improved. Normalization is essentialy to interpret metrics within the context of overall values.

Why collect metrics? #

Regularly collecting and publishing metrics is essential because it replaces subjective, individual experience with objective, organisation‑wide evidence. Occasional anecdotes such as "my use of AI has made it better for me" cannot explain what is happening across your entire delivery system.

Overall, reliability might be decreasing #

Trends in commercial deployed systems have led to:

  • system complexity growth — more services, more dependencies, more failure modes
  • higher change velocity — more deployments equates to more opportunities for defects
  • increased integration surface — more APIs, more third‑party systems
  • cloud‑native architectures — distributed systems fail in subtle ways
  • operational load on teams — alert fatigue, burnout, staffing constraints
  • legacy systems under strain — older systems interacting with new ones
  • rising customer expectations — smaller incidents now count as "outages"

Read next:

[Before You Adopt AI in Engineering, Answer These Five Questions]

Calculate your AI maturity with this article.

The Missing Structure Agile Cannot FixBuilding Safe, Compliant and Sustainable LLM SystemsWhat Tech Executives Need to Know About Working With LLMs

If this was useful, you can get more pieces like it in the Phroneses newsletter.

I work with leaders and teams on clarity, capability, and momentum. Work with me →

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/measuring-reliabilit…] indexed:0 read:11min 2026-06-24 ·