AI has enabled businesses to generate code at an elevated rate. There is anecdotal industry evidence that production reliability has decreased. But, how to know this accurately for your business?
Collect and act on metrics.
Why metrics matter #
Metrics matter because they turn an unpredictable system into a manageable one. They give you reliable sight of what is really happening.
Organizational risk is increasing due to AI-driven delivery, and this can impact customer experience and your personal accountability.
Metrics support defensible governance during a period of rapid industry change.
Failure rate metrics #
There is no industry-wide dataset that directly measures change failure rate before and after the adoption of AI for a delivery process that leads to production change.
To decide if your use of AI is making your delivery better or worse you need to measure your own situation to define a baseline.
To do this, you need to collect metrics that reflect the changing health of your delivery process and its effects on your production environment.
Four metrics #
To surface a picture of your delivery and production health, you need to capture:
- Change failure rate
- MTTR
- MTTD
- Incident volume and severity
In the same way you have observability in place, the above metrics capture the relationship between how you produce solutions to what effect they have on your production system.
Change failure rate
This tells you how often your changes break production.
Production reliability problems ultimately begin with one of two things:
- A change that should not have been deployed
- A change that was deployed correctly but behaved incorrectly
The first is a decision failure as the change was wrong before it reached production: the code was incorrect, incomplete or logically flawed.
The second is a system-interaction failure. The change was valid in isolation (all quality checks showed the change was good to go) but, once deployed, the change interacted with production in an unexpected way that was not picked up earlier.
When using AI, generated code can be:
- plausible but wrong
- incomplete
- inconsistent
- missing edge cases
- violating invariants
Such code will increase the number of changes that should not have been deployed.
AI use also increases change volume and this will affect more parts of your system. This increases:
- integration risk
- emergent behaviour
- subtle regressions
- interactions with legacy code
This increases the number of changes that behave incorrectly only after deployment.
Interpreting CFR
If CFR rises, you need to know did your use of AI generate more incorrect changes, or did AI accelerate delivery and expose more integration failures?
Without separating the two, you cannot attribute the cause.
Mean time to recovery
MTTR tells you how long you stay broken. It is the single best measure of operational resilience because if you are broken for 24 hours but your customers can still use your systems effectively, then you have a degree of resilience to system flaws as they do not negatively impact your customer.
Even if CFR stays constant, MTTR can become worse because using AI can introduce more subtle defects (that are likely to be missed in a large volume of code generation), and your system become harder to correct.
Subtle defects
Imagine your AI rewrites a database query for readability:
SELECT * FROM users WHERE id IN (SELECT user_id FROM sessions)
you test the above and everything passes. Test uses 50 rows for users and 10 for sessions.
But, in production, users contains 10 million rows and sessions contains 50,000,000.
Given this different data environment, the first thing that happens is that the subquery (in parentheses) becomes unbounded. It will return every user_id in sessions.
The database will create an in-memory version of sessions to check for value
membership. This is because the SQL query tests for this using IN
. Every value from sessions must be read.
Even when we consider both sessions to have been built once in memory (and a membership test costs one unit of time) and that users is indexed, every row in users must still be scanned. And sequential scans are inherently expensive.
The performance outcome in this case is that a sequential scan of users is expensive, but when the IN list is huge (50 million values), every alternative database query plan is even more expensive, so the database optimiser chooses the scan as the least costly option. A scan will be faster than the alternatives but such a scan is still costly in terms of input/output and CPU use.
A large sessions table leads to a huge IN
list which means any index on users is of less value. Because of this the database query optimiser scans the whole of users. And in production, users contains 10,000,000 rows.
In short, the size of sessions trigger the database to choose a query plan that scans 10,000,000 rows.
This is a subtle issue to catch in test as test has used a tiny dataset.
But the key here is that the AI has no awareness of your production table sizes.
AI has written a theoretically correct query that breaks down when exposed to the realities of production.
There is more to writing code than just the text. A full awareness of the environment in which that text is running is required. And the AI does not have that awareness.
Your business becomes dependent on generated code that is not fully understood.
The interaction of two table sizes on performance
Engineers have an appreciation of this matrix.
The same query will operate differently each time it is run as the sizes of users and sessions vary. If they are both large, a worst case performance may occur.
Large here depends on the database you are using and the hardware enviornment it is running within.
The eventual performance of your code in dependent on factors outside of the code. This is why it is crucial to check the behaviour of your code in a test environment that is an accurate reflection of your production environment. Your engineers and QA staff are aware of this. Your AI is not.
| Table size | Small Sessions | Large Sessions |
|---|---|---|
| Small Users | • Fast query plans • Index use likely | |
| • Subquery grows large but outer scan still cheap • Hash table from sessions is large but overall, still manageable | ||
| Large Users | • Index use leads to good performance • Optimiser avoids full scan | |
| • This is the worst case: • A huge IN list and a full scan of a large users table • Result: a high query time • The optimiser is forced into a sequential scan on users |
Interpreting MTTR
If MTTR rises, users experience longer outages If MTTR falls, reliability is improving even if the change failure rate is unchanged
Mean time to detection
MTTD tells you how long you remain unaware that you are broken.
AI can affect this in two ways:
- more subtle regressions so a broken production is harder to detect
- more automated monitoring so a broken production is easier to detect
More subtle regressions
AI can generate code that can hide defects. And the defect may be subtle, as we have seen:
- it only appears under real production data, not engineer-run tests
- it only appears under real concurrency, not local runs
- it only appears under real load, not pre-production quality staging
- logs may not show anything suspicious
- the behaviour may be intermittent, so alerts do not fire
Interpreting MTTD
The system is broken, but nobody realises for longer. That is an increase in MTTD.
If MTTD increases, you are blind for longer. If MTTD decreases, you catch issues earlier.
Using MTTD to interpret MTTR
Without MTTD, you cannot interpret MTTR correctly. This is because MTTD is a component of MTTR.
MTTD is the time between a failure occuring and the failure being detected. This is the blindness window.
MTTR after failure detection is the time from failure detection to recovery. This is the repair window.
MTTR can refer to the sum of both these windows of time. But, the two windows behave differently, and they are influenced by AI in different ways.
Separarting the two is important because if you only look at MTTR, you cannot tell whether:
- detection slowed down
- recovery slowed down
- both slowed down
- one improved while the other got worse
Two organisations with the same MTTR can have two totally different operational realities.
Affecting MTTD with two types of MTTR
There are two types of MTTR:
- detection-inclusive MTTR that includes the time you were unaware of the issue
- post-detection MTTR: how fast you fix things once you are aware of the issue
The first shows how long users were affected. The second describes how fast engineering can recover.
AI's effect on MTTD
AI can both increase and decrease MTTD, depending on how you use it.
AI can introduce subtle effects (as above with users and sessions database tables) that can:
- pass tests
- look plausible so passes code review
- only appear under real load or real data
- do not trigger alerts immediately
These will increase your blindness window.
AI can improve:
- anomaly detection
- log analysis
- metric correlation (not causation)
- alert generation
The improvements reduce the blindness window.
AI may increase or decrease the time for MTTR after detection, depending on whether:
- your use of AI helps engineers debug faster
- your use of AI produces code that is harder to reason about when run in poduction
And you cannot know which effect dominates unless you measure the components separately.
Incident Volume and Severity
This tells you how often and how badly things go wrong.
Even if your change failure rate and your mean time to recovery look stable, incident volume can rise because:
- deployment frequency increases
- AI accelerates code generation
- system complexity increases
- more third‑party dependencies fail
Incident volume is the only metric that captures the total operational load on the organisation.
How has your use of AI affected your business? #
If you are using AI operationally, you can measure what effect it is having by considering these metrics. Each metric must be normalised.
| Metric | Before AI | After AI | Normalisiation Explanation |
|---|---|---|---|
| Deployments/week | X | Y | Adjusted for team size and release cadence |
| Change Failure Rate | A% | B% | Calculated as failures per change, not absolute counts |
| MTTR | M | N | Split into detection and recovery components |
| P1/P2 incidents/month | U | V | Adjusted for deployment volume and service footprint |
| Lines of code changed | L1 | L2 | Normalised per engineer to remove team size effects |
| AI‑generated code (%) | 0% | K% | Expressed as a proportion of total code changes |
Normalisation is required to show that metrics before and after the use of AI reflect real changes in performance and are not due to unrelated factors such as:
- team size
- deployment frequency
- service footprint
- code volume
- organisational growth
For example, without normalisation, the metrics might be misleading:
- if deployments double, incident count may rise even if quality improves
- if the team grows, code generation volume increase even without AI
- if the system footprint expands, MTTR may rise simply because more services exist
Consider this: you double deployments, and your number of incidents doubles. Your quality has remained the same. But if you double the number of deployments, and the number of incidents increases by 50%, your quality has improved. Normalization is essentialy to interpret metrics within the context of overall values.
Why collect metrics? #
Regularly collecting and publishing metrics is essential because it replaces subjective, individual experience with objective, organisation‑wide evidence. Occasional anecdotes such as "my use of AI has made it better for me" cannot explain what is happening across your entire delivery system.
Overall, reliability might be decreasing #
Trends in commercial deployed systems have led to:
- system complexity growth — more services, more dependencies, more failure modes
- higher change velocity — more deployments equates to more opportunities for defects
- increased integration surface — more APIs, more third‑party systems
- cloud‑native architectures — distributed systems fail in subtle ways
- operational load on teams — alert fatigue, burnout, staffing constraints
- legacy systems under strain — older systems interacting with new ones
- rising customer expectations — smaller incidents now count as "outages"
Read next:
[Before You Adopt AI in Engineering, Answer These Five Questions]
Calculate your AI maturity with this article.
Related Articles #
The Missing Structure Agile Cannot FixBuilding Safe, Compliant and Sustainable LLM SystemsWhat Tech Executives Need to Know About Working With LLMs
If this was useful, you can get more pieces like it in the Phroneses newsletter.
I work with leaders and teams on clarity, capability, and momentum. Work with me →