The Root Cause of Never Learning

Adrian Hornsby, former AWS principal engineer and founder of Resilium Labs, argues that engineering organizations fail to learn from incidents because they seek a single root cause rather than understanding the complex interactions that lead to failure. He warns that AI adoption widens the gap between how systems are imagined and how they actually work, undermining resilience.

The Root Cause of Never Learning How we can learn from the gap between "Work as Imagined" and "Work as Done" In this episode of Root Cause we sit down with Adrian Hornsby - former AWS Principal Engineer, founder of Resilium Labs, and author of Why We Still Suck at Resilience - to get to the root cause of what's quietly breaking inside engineering organizations as AI absorbs more of the thinking. We dig into the gap between how we imagine our systems work and how they actually work, why that gap is where all the real learning lives, and what happens to a team when the thinking itself gets delegated to something that sounds confident but doesn't know which walls are load-bearing. Below you’ll find the text version of this episode, for those, who prefer reading : Guest: Adrian Hornsby — Resilience engineer, former AWS principal engineer, and founder of Resilium Labs Adrian Hornsby spent over two decades in software engineering and operations — from researcher at Nokia working on distributed systems, to building messaging systems for millions of users, to nine years at AWS as a principal engineer on the fault injection team. About a year ago he left to found Resilium Labs and write a book, Why We Still Suck at Resilience . Its premise is uncomfortable: organizations have poured a decade into chaos engineering, game days, operational readiness reviews and incident retros — and the same incidents keep happening. In this conversation, he and Nune get to the root cause of the gap between how we think systems work and how they actually work — and why AI is widening it. Why “root causes,” plural Nune: When you learned the name of my show, we joked that it should have been “root causes,“ plural. Can you unpack that? Adrian: The short answer is that it’s comfortable to think there’s one reason something fails, but in practice it’s never like that. It‘s the accumulation of small things happening at the same time that creates the condition by which the system fails. The long answer is about the nature of complex systems, and how systems fail in what‘s called emergence — the interactions of components. It isn‘t really the components themselves that fail, it‘s their interaction. You can have two components working very well by every measure of the components themselves, but the interaction creates the problem . Once you start to think about systems like that, you can unpack an almost unlimited number of reasons why a system fails. Root cause really comes from the old industrial era. You had, for example, an assembly line, and something would fail, and you‘d say, okay, let‘s go fix that — that was the root cause. It got carried over to software systems through safety-one. If you‘re interested in resilience engineering, you hear the terms safety-one and safety-two. For some reason people have liked to have one root cause, and I kept using it, and still today you use it. So, sorry, I got carried away again. Nune: No, it‘s perfect. I personally picked “root cause“ because I like to dig into things — to find the reason things happen and not just stay at the high level. You keep asking why, why, why, and hopefully you get to one or many root causes. But now I‘m a bit stuck with the name, because whenever I choose a topic for an episode I have to formulate it in a negative way — you don‘t usually go looking for the root cause of a good thing. Which is probably also not correct. We should also look into things that went well and try to unpack why. Adrian: You‘re touching on something so important to resilience engineering. The default when there‘s an incident is to look at what went wrong. But if you think about systems, they go well most of the time . It‘s only at the moment when emergence happens that the context starts to create failure. And that‘s what we want to understand — what worked well in most of the scenarios, and then the context when things went wrong, so we can really understand what happened. So saying “root cause“ is actually a problem in incident analysis, because it biases engineers — it anchors them to look at what went wrong, when we should also ask, what went well? And then see how we can do more of that. The default when there‘s an incident is to look at what went wrong. But systems go well most of the time. Nune: Even in our day-to-day life, every person is wired to think about the negative things that happened. But when you think about it — how amazing is it that the day goes by smoothly? How many millions of possibilities there are for things to go wrong when we leave the house, and yet they didn‘t happen. We need to appreciate all the good stuff. It isn‘t the components themselves that fail — it‘s their interaction. Work as imagined, work as done Nune: There‘s a term I think we‘ll use a lot today, and for people who aren‘t familiar with the terminology: work as imagined and work as done. Can you explain the gap between them, and maybe a disastrous example if you have one? Adrian: Work as imagined is the version of work that lives in documents — diagrams, runbooks, organization charts, slides. It‘s really clean. It‘s what we write when we‘re caffeinated, during the day, all happy and smiling. It‘s very logical most of the time, and it‘s defensible to auditors and to leadership, because that‘s what we design. But work as done is what actually happens . It‘s what you do at three in the morning when you‘re on call and you skip a few steps in your runbook because you‘ve learned those steps aren‘t the right ones — they haven‘t been updated since the last version was deployed six months ago. It‘s the engineer who learns that a particular alert is more important than another. You start to learn how the system behaves in production. And all of that is often not in the design docs, not in the documentation, and certainly not in the architecture . What‘s really important to understand is that work as done is not a deviation from work as imagined — it’s a structural gap that happens naturally in every system . First you need an idea of what you‘re going to build, and between the idea and what gets implemented there‘s already a difference. Then you make shortcuts, you forget to implement something, or you implement something else. So naturally there‘s a difference between what you imagined in your head and what gets done. As an engineer you need to understand that, because we often think about our system from the imagined world, but we really look at it from the done. The only situation where you look at the done is when there’s an incident — that‘s reality knocking at the door, saying, it‘s not what you imagined. The only time you look at the done is when there‘s an incident. That‘s reality knocking at the door. So when you have an incident, you don‘t necessarily want to understand what went wrong. You want to understand what the incident surfaced between what you imagined and what is done — what surprised you, and what changed in your mental model . That‘s the only thing you can do in that situation: learn from it. And that‘s why I often say the only answer to understanding that gap is learning. That‘s the thesis of the book. The only answer to understanding that gap is learning. That‘s the thesis of the book. When seniority detaches you from reality Nune: There are so many layers to this. When you start in IT and you‘re not leading teams yet, you‘re solo — you‘re implementing, you deploy the thing, you test it. So you naturally know every step of what‘s done. What always surprised me is the role of the software architect. I‘ve seen organizations where the software architect doesn‘t come from software development — they‘re only an architect. That‘s surprising, because it means this person lives in that quite perfect world you described: not tired, having time to think about the system and its good or bad outcomes, and then architecting it. Maybe that‘s a good thing, because they need that time to think. But on the other hand — how can you be a good architect without ever touching the code? Do you think you can be one? Adrian: There‘s a lot to unpack here. Let me go back to the first point about architects having a version of the imagined world. It‘s not just architects. Often you have senior engineers too, where naturally, as you grow in seniority, your areas expand across the organization. So you spend less and less time on the day-to-day operation of the service. And because of that, your version of the imagined reality — your version of the software in your mind — is slowly detaching from work as done, because your work is spreading. I was a principal engineer at AWS, and I had a really big gap between my imagined version of the software and what was done on the ground, because I was dealing with a lot of different things. So it’s not the role. It’s a structural feature of organizational growth and promotion. And it has a cost. That‘s why I say that the more senior you get, when you start to realize you‘re detached from reality, you need to switch from decision-maker to decision-enabler . You change your role from making the decisions about how things should be done in the real, done world of work, to helping people at the edge make better decisions through your knowledge and your experience. And often this is where you see it not happening in organizations — because when you move away from the done, you lose grip on it, but you want to keep making decisions, keep being relevant. It’s a resistance to being detached from reality. So the long story short is, yes, those kinds of roles are detached from reality, and they make decisions that impact other people at 3am . That‘s a big problem. But it‘s not just architects — it‘s any role that is detached from reality and that keeps making decisions. The team will absorb the difference between what the decision-making team thinks it does and what actually gets done. You need to switch from decision-maker to decision-enabler. Nune: It does create that tension, but I think we also need to understand that‘s what abstraction is for. For any human to think and make decisions about a system, we have to abstract it and create models of it. There‘s no way around it. There‘s also something Charity Majors introduced — the manager pendulum — where you need to come back to development every two or three years so you don‘t become too detached. From my own background — I had a previous startup and now I‘m building one — it happens naturally. First you develop it yourself, then you have to find a team around what you‘ve built. So you go higher and higher in the abstraction. And then if you build another component or another department, you‘re back in the trenches doing hands-on work, and then you get out of it again. So for me it‘s hard to accept that this gap exists. You need to embrace the gap instead of trying to fix it, because the gap is not going anywhere. There must be people who deal with the high level and people who do the hands-on work. It‘s a different mindset, a different style of working. It‘s a different role. Adrian: It‘s a different role. I‘m not sure that going back to development as a manager is the right thing. I see a lot of CTOs vibe coding now and talking about it, and I‘m not sure that‘s right. The easier thing to do is to recognize that you’re not adequate anymore to make the decision — to delegate to the people at the edge and free them from the roadblocks so they can make those decisions. I love the idea of “you build it, you run it, you operate it, you fix it,” because that‘s exactly the people who actually run the system making those decisions. It‘s not a bad thing to be detached. It‘s just recognizing that it‘s difficult, and that people clench onto making decisions because it feels powerful, it feels like you’re relevant . But it‘s a mindset. You can be just as relevant as a coach, helping others, and it‘s even more rewarding once you realize your job has changed and you‘re enabling others. That flip is very, very hard. Having been involved in a lot of promotions at Amazon, I think that‘s the hardest part going from senior engineer to principal — people flipping from being the expert in the room to being the enabler . The hardest part of going from senior engineer to principal is flipping from being the expert in the room to being the enabler. The hard part is letting go Nune: That‘s something I struggle a lot with. To be honest, it‘s very hard for me to delegate, and it‘s a completely different skill set. When you work technically and get good at it, this other role of enabling people has a lot less to do with technical abilities and a lot more to do with empathy — knowing when to delegate and when not to. A friend of mine has an article on how that‘s similar to raising a child more than software engineering, because you need to be strict when you need to be, but also say it in a way that doesn‘t hurt the other person. It‘s hard for me to grasp. And on the technical side — you shouldn‘t vibe code into production, but if you‘re about to introduce a new technology into your team, shouldn‘t you at least know more about that technology? Or do you always just find a person who knows it and enable that person? Adrian: That‘s a good question. I‘ll put it in two buckets. The first bucket: maybe the fact that you can‘t let go of being involved in the day-to-day technical details is because you‘re building and you‘re not sure where you‘re going. That‘s what I see most — people have an idea of what they want to build, but it‘s not very well defined. The feeling you have is that by not being there making the decision, somebody is going to change the idea of what you think you‘re building. Nune: I‘m not that worried about changing the idea — I hope I‘m not that clingy. It‘s more that I feel highly responsible that I haven‘t thought everything through yet. So how can I ask another person to implement it when I don‘t have all the answers, but the only way for me to get the answers is to actually build the thing? Adrian: It‘s the same thing. Somebody else can build the thing and work with you so that, together as a team, you‘re building the right thing. For me that was the first bucket: when we‘re not clear on the idea, we get more attached to the details, because we‘re worried the idea won‘t go the way we intuitively think in our head — even though it‘s not clear. The second bucket is simply a trust problem. Can you trust the people you hire and work with to make the right decision? One of my bosses back in the day told me, always hire people smarter than you, so you don‘t have to do their jobs. Always hire people smarter than you, so you don‘t have to do their jobs. Nune: That‘s also been an AWS motto — always hire people who are better than the people already in the company. Raise the bar. I completely agree. When I look back, the easiest things to delegate have been things I was never good at myself. UI development, for example — I know how to do high-level stuff, but I‘ve never become an expert in React. When I had a senior on my team it was easy to say, you make the decisions, you know better than me. But on the things I‘ve tried myself, I‘ve been over-controlling — and I‘m sorry to all my teammates, because they‘ve felt that. Adrian: It‘s a normal thing. I‘ve been very controlling on the things I feel most comfortable with. I don‘t think it‘s about comfort — it‘s about trusting that other people can help you. Nune: So the solution is to know less. Adrian: I always use a mental model when I coach engineers. I explain it like building a snowman. Your idea is the body of the snowman — you start by rolling the big body, but you need the head, the branches, the carrot, all of that to make the idea better. You might have the central 80% of the body done by yourself, but all those additions are what make the snowman a snowman . Without them you just have a ball, and that doesn‘t get you anywhere. I like to think of our ideas like that — the central body of a snowman, and then everybody else in the company, on your team, adds to it. The only thing you can do is enable that. It gives permission for others to make your idea better. You might have 80% of the snowman‘s body done by yourself — but it‘s the head, the branches, the carrot that make it a snowman. Practices for learning from the gap Nune: Let‘s assume we accept that this gap exists and that the gap is where the learning is. Can you give some practical tips for what it means for an organization to learn from the gap? It sounds good, but what does it mean on an average Tuesday morning — not as a philosophy? Because we say you can write better documentation, have a stricter process, more controls — but how do you not get into compliance theater? How do you actually learn from it? Adrian: First, an incident is absolutely a mirror of that gap , because it happened. If you really spend time studying it, you‘ll learn about the done. But if you think a bit upstream, in the day-to-day of an operating system without waiting for an incident — which is the truth happening to you — chaos engineering, load tests, and game days are three really good practices for understanding some of the gaps. Chaos engineering gives you a very good understanding of how you think your system is going to degrade or fail. You inject a failure and you make a hypothesis — you explain beforehand how you think your system is going to behave . Then you actually inject the fault and see what happens in reality. You study the observability and the impact of the experiment, and you have two versions to compare: your mental model pre-experiment and the post-experiment reality. That‘s usually a structural gap focused on degradation and failure modes — a very technical gap. Load testing gives you another angle on the gap, more focused on whether your system scales as you imagine. Before injecting load you have your load test model, and you try to anticipate what will happen — what the load characteristics of your system will be. You inject the load test and then you see the result. That gives you a good idea of where your mental model about scaling is off. Especially what you want from a load test is to find the cliff — the moment where the system just completely collapses . Any system has a maximum it can handle, and then it collapses, and you want a very good understanding of that collapse. Is it at 120% of your scaling model? Is it at 80%? Does it actually break much earlier than you anticipated? That‘s a structural gap that load testing reveals. Game day is a gap more about human coordination. Say you simulate a disaster recovery scenario, and you don‘t prepare the team — you just tell them the whole region is down, let‘s implement the DR. Then you see what happens. Is the team able to coordinate? Do they know what to do? Do they know which service to fail over? Do they communicate well with each other? Do they know how to come back? All of this is human coordination. When you did the exercise, you hoped the team would do well — and in reality they do a lot differently . One of the best games I do at the beginning of an engagement is to remove the team lead, the hero, from the team, and then see what happens . That‘s a structural gap between the imagined world and the done — a different version of it. One of the best games I run at the start of an engagement is to remove the hero from the team — and then see what happens. And then you have operational readiness reviews, which are also a structural gap between how they think the system is going to be operated in production and how it actually is operated. Nune: So chaos engineering, load testing, game days — those are the tools an organization can use to bridge the gap. Adrian: Practices. I‘d prefer to call them practices. Nune: And I think one important event is also the incident itself. When somebody wakes up at 3am, they have to be the hero who fixes the thing no matter what — whether they followed the script or not. In my experience, the next day, or the day after, is very important: that person can say which of the things they followed worked, which didn‘t work for them, and the team can optimize the process using their first-hand experience. If you agree with that. Adrian: Correct. You‘ve always had the experience where, during an incident, you try to follow the runbook and you notice on step two it’s already wrong . So you use your intuition, or what you‘ve learned from another team, and you go fix it. The next day is exactly that. You‘re not going back to root cause — you don’t really care what the problem was. What you want to understand is why you had to skip the runbook, what was different , what the context and the environment were — and then capture that, so you actually have more information about the real version of your system. But it‘s not intuitive for people to do that. They just want to go and fix the root cause. How AI widens the gap Nune: We‘ve established that the gap has always been there. We touched a bit on vibe coding, but I think we can generally say AI has widened it. Previously the gap was between departments, or between different roles inside the team. Now I feel we also have a gap between the person implementing and the AI that‘s actually writing the code, as more and more people use AI for coding itself. Do you see that? And second — when a CTO starts vibe coding, plus the gazillion of articles about how you can build something in five minutes, it creates a narrative that engineering is easy. Why does it take a sprint or two? Any advice for people actually building things on how to explain that it‘s not easy — how to push back in a way that doesn‘t get them into trouble, but also explains that it‘s still hard? Adrian: It‘s a good question. The biggest problem with AI is that it hides the gap. Now, all of a sudden, you‘re asking AI to make decisions and to create things, and it does all of that for you. But the day after, or after the incident, you can’t recall the mental model of the AI — what it did and why . With a human, you could trace back the thinking to some extent. You had a trace of the mental model of the person doing it. So the gap was visible when you looked. Now it‘s not. And even if AI is really good — it writes better code than me, and I use it, no problem — and even if it‘s right 99% of the time, the problem is the 1% . At 3am, your director is not going to blame AI for something that breaks. He‘s going to go to a human and tell the human to fix the problem. If AI can’t fix it, it always falls back to a human at the end of the day. So if AI wrote 99% of the system and at 3am you‘re thrown under the bus to go fix something that was never written by you, you‘re going to spend quite a bit of time acquiring the knowledge to debug, trace, identify, and potentially fix the problem. That‘s the risk. It’s not that it’s not good — it’s what happens to the human when it breaks. It‘s so tempting to deploy stuff that AI wrote, and it works most of the time, which gives you a signal that it‘s good. But then comes the 1% where emergence happens, or different components interact in a way the AI didn‘t plan for, and a human has to go and debug it. The biggest problem with AI is that it hides the gap. Nune: I notice it a lot myself. I‘ve always been proud of how fast I can answer questions about the system — exactly because of the delegation problem, because I did everything myself. Now that I‘m delegating more of the development to AI, I feel I‘m not as fast. Most of the time the answer is, give me a couple of minutes, I‘ll chat with my AI, and then I can answer. Which is peinlich — embarrassing, as they say in German — because you‘re supposed to know how your system works on different layers. From one side it‘s a bottleneck — you‘re limited by how much you can keep in your head and reason about. On the other hand, as a lot of people are saying now, that‘s the load-bearing wall, because it keeps the system working. Adrian: I don‘t have all the answers on that. We want to play devil‘s advocate and stay in control, and that‘s part of feeling uncomfortable when you don‘t know why something went wrong. But what really worries me isn‘t understanding what‘s written. It‘s that when something goes wrong, it goes to a human, and at that moment, under pressure, you’ll be asked about a system you have no idea how it was built . That worries me. I don‘t know how we‘re going to solve it, because it’s a human problem . Humans are going to be blind — they are today and they will be in the future, regardless of whether you have an AI agent helping you. Something is going to give, something is going to break, and it‘s not new. Lisanne Bainbridge talked about the irony of automation in 1983 , and she said that even back then, the more complex the automation, the more skillful your engineers have to be to fix the automation when it breaks . AI is the next level of this. It‘s a 40-year-old theory. The more complex the automation, the more skillful your engineers have to be to fix it when it breaks. That‘s a 40-year-old theory. Don’t delegate the thinking Nune: That brings me to the thought that AI performs better in the hands of more experienced people. So what do juniors do? People who are just starting in IT — how can they learn? Do they force themselves not to use AI? It‘s difficult for them: on one side, people tell them to learn the fundamentals; on the other, there are all these tools that also take time to learn. You don‘t just write a prompt and that‘s it — you also need to find your way of working with all the AI tools. Do you have advice for people who are just starting? Adrian: It‘s a good question again. The cynical part of me feels like it‘s old-man-me talking about the good old days. I‘d say younger engineers will figure things out. They’ll find a way. There are so many unknowns at the moment. They‘ll figure out how to work with these technologies far better than we do. When you‘re AI-native, you‘ll probably develop mental models that are very different from ours, and that‘s okay. The only thing I‘d say is: be careful of efficiency . Efficiency drives speed, and trying not to spend time learning can bite back. Understanding how to debug a system without AI is going to be an important skill. Imagine you‘re at 3am and the cloud is not available and you can‘t ask it what to do — what are you going to do? Are you going to refit the whole context of your application into another model? Probably not. Flying blind is not something new. When you don‘t have monitoring, or you don‘t have Slack, or your team breaks during an outage, you have to find ways around it. If you completely rely on AI for everything, that becomes your single point of failure — and it‘s a big one, in the case of an outage where you don‘t have AI to help you, or AI fails and tells you, sorry, I don‘t know. Then what? You tell your boss, I‘m sorry, I don‘t know, we‘re going to stay down. Claude doesn‘t know. If you completely rely on AI for everything, that becomes your single point of failure. Nune: Actually, when I started working on my current startup, OpsWorker, the original intention was that a lot of development would use AI, there would be a lot of code coming into production, and you‘d have even more observability tools and dashboards, so the load on SRE would grow. So why not also use AI to gather all that information, present it to the person, and help them deal with the complexity? But I‘m very worried that people will use tools to delegate thinking as well. Is there a version of the tool where this can be prevented? Or is that not even a function of a tool, but rather training people, asking them not to delegate thinking? Adrian: There‘s a version of tools that‘s related to efficiency, and there‘s a version of tools that can push back and enforce learning mechanisms — like struggle. I just published the Resilience Companion, open source today. It‘s a companion I built to demonstrate how AI can actually help you learn. It’s so easy to delegate everything to AI, but then you don’t learn anything. And again, going back to when something breaks — if you haven’t learned anything, there’s very little you’re going to be useful for . So you want to keep learning, which means using AI in a way that challenges you, that puts you in a mode where you struggle a little and have to go find the answer yourself . That helps you build mental models about the system, because you have to think about it. That‘s exactly how I built the companion. It asks you what you think the system does, and the more detail it gets, it‘ll push back when you try to stay too high-level. Then it‘ll go and verify when you‘re really uncertain, and that gives you a very good indication of where you need to spend time, because your mental model of how part of the system works and how it’s actually coded are so different . It‘s possible. But it‘s not going to become a standard in the industry, because no one wants to spend more time doing things. The industry — capitalists — love efficiency. They want to go as fast as possible. It‘s so tempting and so easy. You offload cognitive struggle, and the brain loves that. If you can make things easier, the brain says, hell yeah, jump in the wagon. You need people who are aware that they need to struggle to learn, and you need to tell them why it‘s important. But I don‘t see that coming naturally. You offload cognitive struggle, and the brain loves that. Make things easier and it says, hell yeah, jump in the wagon. Nune: To be honest, this is the first year in the industry that I‘m feeling old. Every time I think I‘ve delegated too much to AI and I need to slow down, I read an article or talk to a peer who says I need to delegate even more, because otherwise I‘ll stay behind and the world will have gone ahead while I‘m still here. It‘s a permanent FOMO. I thought I was in FOMO, and then after 2021, 2022, when all this started, it became ten times the FOMO. Learning through crisis Nune: So my question is: does it even make sense to advise somebody to slow down, or, as you said, is it not going to happen? And how do companies that now have this pressure — you must bring AI into the organization — do that in a way that doesn‘t bite them later for being too fast into unknown territory? Adrian: The answer is crisis. They‘ll be faced with a reality that they just don’t understand what they’ve deployed in production . And at that moment, you don‘t have a choice. Either you face reality and look at what you‘ve done and how your workflows have been — and if everything was over-delegated to AI, you can make two decisions: continue and keep getting into crisis situations, or slow down. I already see customers in what I call undeclared crisis — they have similar incidents happening, AI is exacerbating that, and they‘ve come to the realization that they need to change, that something needs to give. Often it‘s spending time learning, and they‘re making those decisions. The thing is, it’s learning through crisis. It rarely comes without the crisis. Crisis is a very good incentive to change, because you‘re in that situation. But that‘s not much different from anything else we do in life. We often learn through failure or crisis — like touching the hot oven . It‘s learning through crisis. It rarely comes without the crisis — like touching the hot oven. Nune: So you mean they‘ll adopt AI so much that it will bring them to a new crisis and they‘ll learn from it? Or that they need to learn already from the incident? Adrian: They won‘t have a choice. The more they adopt AI, the more they‘ll be in a crisis they can’t explain, can’t debug, can’t recover from . At some point, the outages are going to be so much longer, because they can‘t understand the system. It wasn‘t written by them — it was delegated to AI. Thinking was delegated to AI. The documentation is completely outdated. A human will have to go and understand it at that moment, under pressure, at 3am, and that takes time. I‘m already starting to see this with customers. When you write the code, you intuitively know from the error what happened. You receive the alert and you go, damn it — you intuitively think about what might have gone wrong, and you start investigating there. That‘s intuition, and it’s so important in incident response. If AI writes everything, you don’t have intuition anymore. You receive an alert and you go, all right, let‘s build the mental model of how the system works, let‘s dig into the logs and the code and try to understand what happened. And it‘s going to take a lot longer. Your customers are going to scream at you. They‘ll tolerate it a few times, and the third or fourth time they‘re going to tell you, either you change or we change provider . At that moment you‘ll have to make some choices. Either use AI but figure out a way to spend time understanding what AI is writing, or there‘ll be new disciplines — let AI write, but do more game days to understand how your system is failing, and learn how to debug a system you never wrote . There‘s a lot of that already today. It‘s rare that you‘ve written everything in a system you debug at 3am, but you have intuition because you built part of it. So maybe the only viable solution is to do a lot more chaos engineering, a lot more load tests and game days, so you can learn from that gap — because now the gap is even more present. Is that going to happen? I don‘t know. Maybe a new discipline shows up. If AI writes everything, you don‘t have intuition anymore. Nune: I‘m sure there will be remixes of previous disciplines, now adapted for AI. And coming back to the open source you mentioned that you published today — if the tools themselves can help you build that intuition, that would be interesting: to extract the exact amount of information needed to build intuition without knowing every single detail. If we try to root-cause it one last time and summarize: we talked about the gap between the imagined and the done, how AI widens it, how nobody wants to fall behind, how everyone is in FOMO. If you had to find the root causes of this gap, and of why engineering organizations are losing their cognitive capacity — is there something we missed? Or is the answer really, go full speed and learn from your mistakes? Adrian: I wouldn‘t say learn full speed. I‘ll actually point people to your Substack — the article I read about your journey to delegating to AI was wonderful, because you talked emotionally about what you were giving up. People need to read things like this. We need to stop telling only the stories that give everybody FOMO — the 10x that just goes full speed and works. I want to see stuff in production that has been completely written by AI. I want to see an enterprise system with millions of customers, managed and run by AI without a problem — and without having humans to delegate to at 3am . Then I‘ll be convinced. At the moment, I‘m not. It‘s easy to write on Substack or LinkedIn that you‘ve written an amazing tool with AI — great, that‘s awesome. But putting that into production, operationally production-ready, is a completely different beast . I want to see an enterprise system with millions of customers run by AI — with no human to call at 3am. Then I‘ll be convinced. Code is the small part Adrian: It‘s the same when you build a prototype yourself. The code works — awesome. But now a million customers are going to start using that code, and it‘s a completely different story, because you have to start thinking about everything around it: the monitoring, the deployment, the operating, the alarms, the on-call rotation, the documentation, the onboarding. The code is such a small part. So maybe all these AI tools are going to start tackling everything around it. But as long as you only do the code and tell me it‘s ready to go to production, I‘m going to say, show me. Show me. Because I don‘t yet believe it. I‘ve spent a lot of time running systems in production, and the code is— Nune: I always think back to — I think this is the Unix philosophy — make it work, make it nice, make it fast. The “make it work“ part is something AI can help you with a lot. But then making it right — making it architecturally clean, expandable, maintainable, observable — that still requires human involvement, and not just one human, a team. And making it fast — all those game days and load tests, making sure it works smoothly for the number of users you expect. That‘s something we still have to do. Adrian: And that your customer found useful. You can build as much code as you want with AI, but if your product isn’t what the customer wanted, no amount of AI is going to help. It‘s about talking to your customer. And often customers express something they want, but they want something slightly different — they just didn’t have the words to express it . So a lot of the work as a software organization is trying to understand what the customer wants first, then providing answers, then figuring out which answer is better, then iterating and making compromises, because often you can‘t serve everything. All of that is human. It‘s a relationship with your customers, and it‘s prompting your AI correctly. Your AI can be very, very correct, but wrongly correct. Confidently correct — confidently wrong is a much better way to say it. Nune: Yeah, very confidently wrong. Your AI can be very, very correct — but wrongly correct. Confidently wrong. What to read Nune: Do you have any practical recommendations? You recommended my blog post — thank you for that. And there‘s your book. Any other books or resources on how to implement technologies within an organization the right way, be mindful of these gaps, and resilience in general? Adrian: The whole resilience engineering field is interesting — you‘ll learn a lot. I think cognitive science is super important ; I‘m fascinated with psychology, because it has a lot of impact. I wouldn‘t necessarily recommend one book. I think it‘s better to go explore other disciplines, other industries, other sciences , because you find a lot of good in there, especially around human behavior. It‘s so fascinating, because it‘s at the center of everything we do — even though AI is going to be pushing us to write code fast, at the end we’re still in the business of human relations with our customers . No amount of AI is going to fix that. Actually, it can make it worse. I hear a lot of comments from customers who can‘t get a human contact when there‘s an outage — it’s a Google form, or it’s an AI chat, and it frustrates them insanely when there‘s a problem. Humans are going to have to find their place there as well. So I‘d say the Bible is How Complex Systems Fail by Richard Cook — probably amazing stuff. John Allspaw wrote a lot of cool stuff around incident response. Hollnagel is about safety-one and safety-two. Read a bit of all of that, and be curious about other areas. I think that‘s a good thing. In the end, we‘re still in the business of human relations with our customers. No amount of AI is going to fix that. On fear and staying relevant Nune: We have this thing — which is also a bit of a problem for building podcast episodes, because now I have to produce them strictly in the order I create them — where every guest leaves a question for the next guest. The previous guest, Marc Babin, left a question that I think should be difficult for you, because you already share a lot of what you‘re learning. But maybe you can answer: is there something you‘ve learned this week that you haven‘t shared yet? Adrian: I‘m still scared a lot — about opening up. I open-sourced the companion today, and it’s very scary to expose yourself . And it was written with Claude, whereas all my other repos were written entirely by me without AI. So I‘m jumping into open-sourcing something I worked on with Claude, and I have to say it took me quite some time to feel comfortable with that. So I‘d say I learned that I’m still scared about exposing things to the world. After 20 years. I learned that I‘m still scared about exposing things to the world — after 20 years. Nune: Even after 20-something years in IT, you can still be scared of posting something online. Adrian: And it‘s tough, because you expose yourself to the world. Being on a podcast, sharing ideas, writing a book — it‘s scary. Kudos to you, starting your own podcast. That‘s something as well. Nune: I‘m horribly scared as well. One of the things is that the more you‘re able to analyze any question — and I wrote about this lately — the more you‘re able to see both the positive and the negative sides. I can critique myself endlessly. So whenever you put something out there, you‘re cutting that critique out and leaving only the positive thinking: my opinion matters, my open source matters, my product matters. And subconsciously, I think, we‘re afraid of all the critique that can come — which we already know is coming, because there‘s no perfect system. Adrian: So why did you do it then? Nune: That‘s a good question. As I mentioned with Marc, there should be content out there that isn‘t “10x improvement in four days, do this and that.“ I wasn‘t seeing that content, and I thought, as I think any person who writes a book says — I searched for this, didn‘t find it, so I wrote it. That‘s one reason. And another reason is, selfishly, to get better at expressing myself, better at asking and answering questions in a way people understand, so I can bring out more awesome ideas and have people understand them. Adrian: On your first point, this is why I liked your post. It was just so different from the rest. It was honest. You shared emotion, your struggles — and that works. It spoke to me, because you wrote about delegation, about slowly giving it away at each step of the way. I haven‘t managed to go all the way yet — the last steps in your post, with open Claude and doing the thing there. I‘m old school, I still run locally, very controlled, human in the loop. And it made me think: if I don‘t evolve the same way you have, I got worried. Am I going to not be relevant in a few years? That was a moment of reflection I had. Nune: I hope not. I don‘t think so. I think the industry is going to hit the wall and realize you can’t keep delegating more and more of the thinking . And people who have built something themselves will always stay relevant, because of how much intuition they have . Even if it‘s not a system you built, you‘ve seen ten other systems similar to it, so you do have the intuition. So I‘m hoping not. Adrian: Maybe the relevance is more that I don‘t understand the development workflows anymore — multiple agents, the whole mental model of the relationship between the human and the act of shipping product. That‘s what your post made me question: my relationship with building product, and whether I‘ll be able to advise my customers in five years if I don‘t adopt that, because the problems are going to be different. If I‘m not there — younger engineers are already there. It‘s natural for them, and I‘m not; I‘ve resisted to some extent. So that‘s what I meant: am I going to be able to understand that? Nune: Only time will tell. I also read the joke that we‘re the generation that didn‘t want to accept cookies, but we accept giving our code and ideas to AI. So that‘s kind of contradictory. I don‘t think anybody knows if tomorrow everybody will say this was all a mistake, or if we‘ll continue and build more and more agentic systems — supporting agentic systems, fixing other agentic systems. I‘m excited to see. Adrian: It‘s extremely tiring as well, because it‘s a completely new space, and it‘s exhausting. Nune: Of course. Finally we thought we‘d understand everything — we‘re in our 30s and 40s, and we can just enjoy the years. Adrian: The biggest revolution in software happened at 46 , when I thought, after 25 years— actually, I‘ve never thought about it like this, but I think you‘ve got it so right. That‘s why it‘s exhausting. Finally I thought I had it right, and then boom — in a couple of months I have to relearn. You‘ve identified the problem. Thank you. Nune: Well, it keeps us on our toes. The root cause. Adrian: That‘s the root cause. The biggest revolution in software happened at 46 — just when I thought I finally had it right. Nune: Do you have a question for our next guest? Adrian: What surprised you the most in the discipline you were most comfortable with? Nune: That‘s generic enough that anyone who‘s good at something can answer what surprised them in the thing they‘re good at. Thanks for that. Adrian's book - "Why We Still Suck at Resilience" is available with discount under https://leanpub.com/whywestillsuckatresilience/c/rootcausebynune https://www.youtube.com/redirect?event=comments&redir token=QUFFLUhqbWZzZGlvdGx3UmRDOUtFUElRX2JFakZRdlFSUXxBQ3Jtc0tub2lEWjNnZ2J3QVZrQ2FObElNbTBKRVIyb2FJdV9MclhHdk42T1lzanlZZUs0RWh6VmtsaVd3TkI5ekctc2xTdV9mWklLemhFOWJiSmJsaFZPWXBNMWhUQ2NyWVhBT3JPVWpaVU15dWpzTXllRkljZw&q=https%3A%2F%2Fleanpub.com%2Fwhywestillsuckatresilience%2Fc%2Frootcausebynune