Presentation: AI Works, Pull Requests Don’t: How AI Is Breaking the SDLC and What To Do About It

wpnews.pro

Transcript #

Michael Webster: My name is Michael. I'm a Principal Engineer at CircleCI. I'm going to talk to you today about how we're using AI at CircleCI as a company, and also maybe give you a sneak peek of what we're seeing coming in the rest of software delivery right now as more and more folks adopt AI agents. CircleCI is a continuous integration, continuous delivery platform. We sit across thousands of organizations. We run millions of jobs on a monthly basis. Basically, if you want to ship code at the speed that AI can write it, I think we're probably the best software delivery platform out there. That also gives us some really good perspective in terms of having to stay on top of trends and also identifying what's happening in the industry.

Epochs of AI at CircleCI #

I want to go through a couple of epochs of AI at CircleCI. Like a lot of folks, when ChatGPT came out, we jumped on that, and we started back in what I can call the Control-C, Control-V era, where you're just juggling context yourself. You're copying and pasting things in various tools. Then we get the Cursor IDEs and all of the VS Code copilots, where they're now capable of agentic capabilities, long-range tasks. More recently, and when I say recently, I mean like in the last six months the product shipped, we're seeing this rise of headless agents, where we have these agents that we just decide to strap inside of a computer. Once you can strap the thing inside of a computer, you can run it on schedules, you can trigger it on webhooks. I'd say assessing where we are generally in the industry, and definitely at CircleCI, we're in this middle ground, we're playing around with these headless agents.

To do a quick overview, based on where we're at, where we're still primarily working inside of these IDEs, maybe folks are using Claude Code or Codex, what we're finding is at this point, mechanical transformations work pretty well. This was my favorite Slack message Zoom call ever, when one of our internal AI skeptics was like, yes, once I got the acceptance test where I wanted them, Claude is just able to figure it out and follow the pattern, and so now I just have it follow the pattern. For us at this point, the mechanical transformation piece, I think, presuming that you have all of the rules and the criteria in place, these actually work pretty well. The models, especially when you're running on a laptop with full project context, they're able to produce code that suits our most senior engineer's standards, largely because those folks have invested in making sure their repositories are structured in a way that is easy to follow.

We've also had some really good success with non-developer contributions. We had a very long-standing request to add dark mode to our UI, and we could never really get it prioritized. If you're not familiar with doing something like dark mode, you basically have to go touch every single component in your UI and have it change how it displays based on the user's settings. It's the classic thing that no one would actually ever prioritize, but everyone would complain about, why can't we have dark mode? What was really cool about this project is our design team led this. They had a lot of domain expertise on how they wanted the UI to look, how they wanted dark mode to behave. They were able to go and create all of the stylings, and then let the agents largely do most of the mechanical transformation work. They had to get some assists with engineering. There were some components in there that were super clunky, but all in all, we've seen that this pairing of domain expertise plus AI is a really powerful organizational attribute because it allows more people to contribute who normally wouldn't have.

I just gave some case studies on where it works pretty well. Not everything is like sunshine and roses, though. Results are still highly variable. Justin from DX talked about this like, between companies, there's a lot of variation. I found that even within our organization, there's variation between teams. If you've got really good patterns to work from, you can get these agents to work really well. You have to write the rules. Everything is a learning curve. That's one of the things that we've invested in is giving people access to the tools, access to training to use those tools, but still actually getting the patterns in place is actually a much more difficult task than it would seem.

What Isn't Working Well? #

Now let's talk about what really isn't working well. Has anybody talked about a Jira ticket in the morning and then shown up in the afternoon with a PR that has about this amount of lines of change code? I'm not even specifically talking about AI slop here. This is referring to pull requests that are very large. If you have a team that has well-structured rules that can produce code, 1,500 lines of code might be appropriate in some cases if you're bootstrapping a brand-new feature, a brand-new landing page, but it is way too much for a human to review. Most of the rules of thumb and study that I've used historically in my career says 500 lines is about the max that a human can review in an hour. Somebody can produce 1,500 lines of code in probably 10 minutes with Claude. If you're lucky and they massage it, maybe in 30 minutes, so they're like six times what you can actually review as a human. Even when the code is good, the volume that is produced is way too much.

That's not the only problem with PRs. Size is definitely a problem, but PRs are inefficient generally. When you look across some research that some companies have done around median PR review time, you see it can range between 3 hours and 14 hours. The median PR merge time is 14 hours. You can get it down to 3 hours if you're a senior engineer who won't wait for someone to ship the PR and you just merge it yourself, and that's not really a great spot to be in. We want people to have code reviews, but there's a lot of things about the mechanics of PRs that are quite complicated. The fact that they tend to happen asynchronously, they're usually required no matter the size of the change. People have talked about the 5 lines get you 50 comments, 10 lines get you 5, 500 lines get you LGTM. This is a very real effect. It is something that we're trying to improve on with our teams, but we still struggle to get appropriate-sized PRs out of agents. At this point, this is a far-from-solved problem for us.

Where we have had a little bit of success is in applying our review bot to help some teams who are working in unfamiliar codebases. Consider a hypothetical Go developer who needs to go work in a React codebase. Organizationally speaking, the reason a Go developer is working inside of the React codebase most of the time is because you couldn't get the people who would normally own the React codebase to do the work. They're doing something to push their objectives forward, probably because they couldn't necessarily find the UI expert. For those folks in those situations, code review is particularly painful. If you're having to wait for someone to review code that isn't as critical to what they're trying to accomplish in their sprints, going back and forth on basic style and design guide issues is a major problem.

That's where we have seen success with our internal review bot. If you're able to encode your rules, apply those in a review in a deterministic fashion, so whatever rules you have, for loop, apply them, that's really useful for someone who is unfamiliar with the codebase. Again, the problem we run into is more just the mechanics of the PR. It takes you out of flow. You probably could have gotten the feedback even earlier. We would really like this to behave more like a linter. For now, we're seeing some success. We're catching some bugs, but it's definitely one of the areas that we are, I don't want to say struggling with, but we are investing as an organization.

Looking Ahead, Things Get Interesting #

Looking ahead, though, we're talking about largely developers working solo inside of IDEs and CLIs. Like I said, one of the nice things about working at CircleCI is we get a sneak peek into what's happening in the organization. Because we run so many builds, we have such a wide variety of customer bases and languages, it gives us an opportunity to spot some trends. The trend with headless agents is what's keeping me up at night these days. To start with, let's take a look at some activity on GitHub. In Google BigQuery there's an archive of all activity in GitHub, it's called the GitHub Archive, and that you can query with SQL data. It's incredibly cool and interesting to explore.

As soon as these headless agents came out, run an agent in Actions, run an agent in CircleCI, I wanted to see if people were actually using it. If you look at the trend, you see a really big spike in activity right around the time most of these agents launched. For this analysis, I'm looking at well-known agents. They have well-defined GitHub usernames. I can definitively say it's the agents. You can see the trend here. Very quickly after they launched, we're up to hundreds of thousands of activities in GitHub a week. If you've ever looked at GitHub's webhooks and the events that they support, there's a lot of them, so I figured let's drill down even deeper. That presents a really interesting picture. When you look at when these agents launched, pretty much all of them, their primary use case was pull request review or issue triage. It was rewriting issues, finding duplicates, reviewing PRs. You see that in the data. The big spikes that you see are an issue on reviews and issue comments, but around May of this year, you started seeing a spike of push events.

Now we're at the point where on a weekly basis, and I haven't reran this analysis lately, so it's only increased since then. I just don't have the charts for it. You have AIs that are pushing just as much code as they're writing. This is really concerning because now we've really decoupled this thing. It's not just a friendly bot that's trying to help you with reviews. It's pushing code into your repos. You've looked at the public timeline data. Maybe that's how people are experimenting with these tools, but we see the same picture when we look internally within CircleCI. This is our usage level. It lags GitHub. We're one CI provider. We don't have every open-source project on us, but the key thing here is people paid for these builds. These are the same activity that are attributable to agents that just came out of nowhere that people actually set up a pipeline to process those changes. This isn't just someone updating a README file. These are real customers applying this to real-world software projects.

We've achieved the headless agents. They're all working autonomously. Wonderful. Of course, that's not what happens in reality. If you look at the most recent DORA report, which focused on the state of AI-assisted software development, what they found was, yes, these agents are increasing velocity. Totally makes sense. We're seeing an increase in velocity in our customers. You see it across overall on GitHub. They also found an increase in instability, so higher defect rates, higher change fail rates. Again, this is all variable based on the organization. The point is, while we have a bunch of this stuff that's pushing code, it doesn't always work very well, and in some cases makes our products even more unstable.

This is another study that I found that looked at some open-source projects using AI IDEs. The goal of this study, what was unique about it, was it was focusing on long-term impact. There had been studies to do things like do people merge the PRs from these AI bots, but this was looking at what happens over a several month period across several thousand open-source projects. The big takeaway is you got a month of increased velocity from entries in Cursor. That was the specific tool they studied. After that month, it went away. Not like you flatlined or you didn't keep growing. You went back to baseline. What they attributed this to was this quote, which is terrifying, "We observe persistent technical debt accumulation in the changes from the AI assistants." I think this is probably something that folks intuitively think or understand is happening, but this is a study that was showing this with a causal relationship to a reduction in velocity. Again, the net effect here was that these projects saw a meaningful but very temporary bump in velocity. Then they basically went back to where they were before.

The output isn't great. The benefits are unrealized. Fortunately, it gets even worse, because at some point, even if we fix those issues, we have some basic math that we have to deal with. This is a very simplified idea of queuing theory. How do systems that have work arriving and processing, how do they behave? If you've ever been to a checkout store that is very busy that only has one person on the register, you probably get this. If you've got work coming in faster than you can process it, you get delays. One of the nice things about queuing theory is because we have this math, we can do some simulations. We don't actually have to wait to see what happens when we have hundreds of thousands of PRs being opened a week. We can simulate this.

In this simulation, we assume an organization that is able to process changes of code, deliver it to their customers twice as fast as they're able to write it. Then we crank up the scenario under various conditions, depending on how AI makes you faster or not. If you get up to about a 75% increase in AI in throughput, to be fair, would be a very large throughput increase, but not unreasonable. The delays that you would see waiting for someone to look at your change would basically go to infinity. What we're seeing is these tools are being more and more adopted. The usage is only growing. The rates of input are probably even more than 75% faster, but the delivery rates are nowhere near that speed. This plays out some scenarios, and basically, if you're not able to speed up your delivery comparable to the rate that AI can write the code, you're really not going to see any benefit. It's all going to get washed out by all of the delays in your delivery process.

The reality is, for most organizations, even if you did have AI going as fast as you wanted it to, you probably as an organization and the objectives that you're trying to achieve, you couldn't go faster, even if you wanted to. Right now, everybody wants to, and they're putting a lot of money behind it. This is not a great situation. There's a lot of things that we can try to do about this. We can try to optimize our pipelines. We do this internally, rewriting slow scripts. Optimizing how you parallelize your tests, being really efficient about code reviews. I think a lot of those end up being band-aids. To me, the higher leverage point that you have is focusing on validating the outputs of the agents. This is a very simplified view of an agentic loop. We're going to give it some tools. We're going to give it a task. Then while the task isn't complete, we're going to keep running.

I think a lot of times in the industry, we focus a lot on the task piece, and we talk a lot about the tools piece, both of which are important, but I think we undervalue the check piece. Because if you have the ability to validate the AI, you can let it run as fast as possible. Not only that, but when it fails, you now have a dataset that you can go and train the AI to be better as it occurs. I think a lot of our time really needs to be spent on this check and validate phase, not on the particular task and tool definitions. We've seen this happen a lot with AI, where there's a particular prompting strategy that becomes dominant. I'm glad people are doing that research, but very quickly what happens is those prompting strategies get trained into the model. They just always use chain of thought. You don't have to ask them to do that.

I think, in terms of our investment priority, focusing more and more on the validation is really the only way we can get the speedups in delivery to make it possible. Because once you have a good validation, once you have a test set that says, if these tests pass, this code is good to go to production, you've basically eliminated all the barriers and the bottlenecks. You still have to make that run really fast, but you do need that suite to begin with.

Let's talk about some of the usual approaches that we would take here. Unit testing is a really good example of this. The typical way you run tests is you run the full suite. Every commit, you run the full suite. This is fine up to a pretty high scale. We do this internally on very large monoliths. It works really well, but it starts to buckle again under the pressure of 10 times more commits that you're seeing versus with humans. What if it was a different way? We're implementing this feature right now that we've used on one of our large UI monoliths. It's an old idea called test impact analysis. The idea here is your coverage can guide you in terms of speeding up your testing. If you think of your coverage as a dependency graph between your source code and your tests, you can use that dependency graph to prune the space of tests that you have to run. Here's a picture of what that looks like.

Here we've got a frontend application. You've got some frontend components for a user in a todo. You have some APIs to fetch that information. Based on these names of these files here, we can make some inferences about their relationships. Let's imagine that we want to change the user code. We can immediately eliminate the todo related tests. As long as there's no dependency between them, which if there's a dependency, our coverage tracing will catch it. In some cases, we may even be able to run a subset of a subset of the user component tests. Consider the case where all we do is change the API for fetching the users. In that case, we probably need to test the component because it's going to have a dependency on the API to fetch the data, but if we just change the component and there's no dependency on the fetch API itself, then we've eliminated further even our test run.

This is the basic concept of test impact analysis. It felt a little too simple to work, but then we tried it out on one of our large frontend monoliths. This is the main monolith for CircleCI's UI application. We were running 10 parallel instances of the test jobs. It was taking about four minutes. It wasn't an unbearable length of time, but we thought we could do better. When we applied this test impact analysis approach to it, we were able to cut that entire build time with no parallelism on a single machine down to one and a half minutes. This was something like 7,500 tests, so not an unreasonable number of tests to have inside of a monolith either.

This was very encouraging because now what this means is we can take an AI agent, have it work as fast as we're willing to spend money on the tokens, and give it a tool to run only the tests that it needs to run on the changes that it needs. Instead of having to figure out how do we parallelize and farm out all of these agents and how do we parallelize all their test runs, and should they really be using CI this way, we can just make it so that it's tractable to run them inside of a Docker container on a laptop or in the cloud.

The same principle applies to code review. This is pretty experimental and theoretical where we're thinking about this, so I'm not going to go into too much detail on how this might work. The idea of test impact analysis is that not all code has the same level of risk or the same relationship to each other. Same thing applies to code review. Not every single change is impacting how you salt passwords in a database, or how you do authentication and authorization checks. Some of it probably doesn't need the same level of scrutiny as everything else. This again gives us a spot where we can prune down how much we have to do every time we have a change from the agent.

If the agent is touching a part of the codebase that is particularly sensitive, obviously flag that for review. If it's not, maybe that can be automatically shipped as long as the test pass. That's the theory, at least. We're also playing around with this idea of reviewing the reasoning of the agent versus the output. This is a screenshot from our UI where we're going reasoning first. This is mostly a UX thing, but instead of the first thing you see here being diffs, what you see is a log of what the agent did. I think ultimately this probably ends up where things go, where we're examining reasoning traces and then relying on test suites and linters and static analysis tools to validate the diffs.

Keeping It All Working #

Now, of course this is not the only thing that we have to do. If you recall back from the earlier graphs that we had, this delivery pipeline is now critically important to getting the value out of AI. It's not just something that like, the build's annoying or slow. Many companies want to run agents 24/7. They want to churn tokens. They want to have AI teammates that are able to work. The pipelines that we build aren't just like, guess who gets to play foosball today if CI is down? They become a critical piece of our infrastructure. I really think that more of our time as engineers likely goes into this area, where how do we make sure that the overall factory, the machinery that's surrounding the AI, the environment that it operates in, is conducive for code changes versus worrying quite so much about the individual changes themselves.

Let me talk about an example of this. Anyone here ever dealt with flaky tests? A flaky test is a test that sometimes passes and sometimes fails even when the code hasn't changed. This is a really hard problem to fix. I will tell you how I know it's a hard problem to fix. Conceptually, the idea behind a flaky test and why they're so problematic is they end up being really corrosive to developer trust. When your developers don't trust the test suite, when they don't trust that a red build is actually a red build, they're much less responsive to what's happened. It's also just really annoying. If you ever had to deal with a flaky test, oftentimes the way that people fix it is they rerun the build and they just roll the dice again and hope that it's really not that flaky. When we look at this, this seemed like an area where improving the validation pipeline itself was a really good investment for us, because if we can reduce the flaky tests, then we can make sure that the pipelines themselves run consistently, and we can make sure that we get a clean signal every time. That's really important if we want these agents to write code, is giving them a clean signal of if something was good or bad, because as soon as you delve into, was it a flaky test, or let me go look at the logs, or, this dependency was down again, you're back to having to manually dig through everything, and everything grinds to a halt. Our thought process here was like, can we, in the background, keep improving the test suites across all of our projects?

It turned out that we actually had all the data that we needed to do this already. One of the things that our platform does is we record your test results. You can upload them to us, and we record the SHA that it was at, as well as the result of the test. Based on that, we can do some work around the git SHAs that were involved and what jobs were running at the time, where we can infer, was this test flaky or not. What it might look like is, I push it, and it failed. Someone reruns it, and it passed. Probably flaky. All you really need to do this is a database. If you need to do this for thousands of people, you need a pretty efficient database. You can do this yourself. All of the major test result reporters all output XML or JSON, you can use this to load up your database.

Then you can point your agent at this dataset and just have it start trying to fix things. I will emphasize the try. Again, this was one of the first things that we thought about applying directly as an agent to AI, and the first attempt totally failed. We started with a prompt. We let the agent be very agentic. The gist of this summarized version was we could tell it a test that we knew was flaky. We tell it to run it some number of times, try and diagnose the failure, and then write results to a file. The main issue we hit with this was hallucinations. What would happen is the model would say it ran tests, it didn't actually run those tests. We would then present that PR to our customers, saying, we had a fix for you, and the linter would fail. Or the tests would actually fail entirely, which is really not a great experience for anyone, not what we wanted either.

We came up with a more robust solution, which looks a lot more like code. This is on the spectrum of building agents. This falls more on the workflow side of things. If you're familiar with the Anthropic paper, it's less agent and more predefined workflow. There's still a lot of AI in here, but what we would do is we'd take the list of tests that we knew were problematic, and for each one of them, we would go and plan a fix. One of the issues we had with the initial suites was actually isolating the changes that the agent was doing between tests, so you couldn't actually tell if things were broken or which tests were broken. Some fixes might have been good, some might have been bad, but we'd batch them all together. This is a really simple fix, just changing it to an actual loop.

Then we added some explicit steps to make sure that we were saving context between runs of the agent. We would plan a fix for a specific test. We would attempt to reproduce a fix for a specific test. Then, most importantly, we had this apply_plan feature. The thing that we added to the apply_plan feature, which was most effective here, was we actually started using the CI pipelines of the projects as a way to prove to the agent that it worked. Our reasoning on this was, you've already built this pipeline. You've put all of this time in to tell you if the code is good or bad. It's got to be green to merge, so let's just use that thing in order, as the judge, to tell if things fix.

Instead of LLM-as-a-judge or even trying to get the LLM to rerun the test, we thought we had a fix. Let's go run it in CI, and at the very least we can verify that there's no regressions, and it gives us the ability to loop back. If there is a failure, we can look at the output of that failure, and then rerun it again and just have it reapply, taking into account the issues that it ran into.

Where Is This Going? #

I'm going to try and connect some threads here. If you go back through the concept of the PR, the issues that we hit with that, where it's not even so much the size of the change that's the problem, it's the timeliness of the feedback. You look as agents that are running all of the time and constantly producing code, this really creates a problem for the way that most of our models of software delivery work. In our heads, and I think it is in our heads, a lot of these processes are modeled as linear and discrete processes. You push your code, then you build, then you test, then you deploy. If you've worked in a CI pipeline before, you know that most pipelines are much more complicated than this.

You want to fail fast. You want to run your linters before you run your expensive integration tests. No sense doing that if the code won't even compile. I still think in our heads the mental model is, again, this very linear process. I think that that model doesn't really hold up with AI. When you create these big boxes and processes that tend to be quite slow, you really limit your throughput. The worst thing about having a parallel system that's providing work is forcing everything through a serialized pipeline. That will just never work.

I think where this goes, and it's very forward-looking, there's been a lot of talk about should we get rid of CI or not, there's this idea of a lot of this stuff doesn't actually have to be linear in time. There's no fundamental law of the universe that says that a PR review or a code review has to happen before the code is deployed. There are a lot of compliance rules that say that that has to happen, and those are important. It's not a fundamental fact of software delivery. There's nothing that says that a signed git commit that is past all of the unit testing can be attested to, it was the contents of the commit and the output of the tests, can't be initiated by a developer running on their laptop or running in the cloud, and that counting as the tests have passed.

The same thing applies with everything we do. About the only thing you have to do is probably push. At some point in time, the code has to be pushed. I think this model of breaking down these very linear processes that have this very straight-line flow and turning it into something where things are happening at various points in the process across time. Then instead of having a linear yes-no, we combine these things into a single gate where all we have to do is keep track of what has occurred. Really, what more do you need? For continuous delivery, if the tests pass, you should deploy the code. Everything else besides that is really us being concerned about other things, which are really that we're not confident in the tests. That's why I think more and more of our effort and energy is likely going to be spent in this testing and validation layer, less so thinking about the specific designs of low-level details of our services.

Chunk by CircleCI #

Now, like a lot of folks here at CircleCI, we're trying to take a lot of these learnings that we've had over the last six to nine months and package them up. Like I said, things like test impact analysis, flaky test analysis, improving your code review process. These are all fun exercises, but they may not be the most efficient exercise. One of the things that we're working on is this agent we're calling Chunk. The idea of Chunk is to really take all of these learnings that we've had internally. How do you get the models to write good code? How do you get the models to review the code in a way that is conducive to your organizational standards? Then, how do you keep that delivery pipeline fast so that the agents can be writing code as quickly as possible and putting that into practice and wrapping it all up together?

When we talk about Chunk, the idea here is it's built to be validation first. For us, the rules that you define as good enough for your production environment is the standard that we hold ourselves to. We're building this to keep your software production ready. Starting with flaky tests, moving towards having the right tests and the right levels of coverage, and then, of course, learning as we go. One of the nice things about a CI system is it knows when you break things. We know when you merge. We know when you revert. We know when you roll back.

See more presentations with transcripts

source & further reading

infoq.com — original article