{"slug": "ai-s-overnight-solution-for-our-flaky-tests-took-two-weeks-to-adopt", "title": "AI's \"overnight\" solution for our flaky tests took two weeks to adopt", "summary": "A developer used Anthropic's Opus 4.6 model in Claude Code to fix a group of flaky tests that caused 60% of CI runs to fail. The AI analyzed hundreds of test runs overnight and identified root causes, but it took two additional weeks to manually refine the AI-generated code changes into production-ready fixes. The restored test group now runs with 0% failures on main, though the developer notes that the non-flaky tests now produce more false positives than the previously flaky group.", "body_md": "Recently I stopped a group of flaky tests from running in CI. 60% of CI runs were failing because of this group, which was unsustainable. Three weeks later I was able to restore that group to CI, with 0% failures on main 1 resulting. Our “non-flaky” tests now give more false positives than the (previously) flaky group.\n\nThis is not really a post about tests though, it’s really about AI’s contribution (a lot) and what it took to make that contribution usable (also a lot).\n\n##\n[\nThe hardest problem\n](#the-hardest-problem)\n\nDevelopers on this project had been quarantining tests with a `:flaky`\n\nlabel for several years. The strategy was to quarantine a small group which could be expected to fail randomly but could also be re-run easily and separately from the full suite. Apart from the flakiness, the test suite is comprehensive and gives us high confidence that if we merge something after tests pass, it works.\n\nOver the years, several developers had tried for a week at a time to reduce flakiness, all resulting in failure. In our defense, the flaky tests centred around interactive pages using Stimulus or Hotwire, and online discussion of this topic is a combination of ideas we tried already, plus someone saying: “I tried a lot, it doesn’t work, I think there’s a bug”.\n\nThe most promising angle was adopting Playwright, which did improve some things but also left us with some tests that failed permanently and needed to be skipped. There’s a dissatisfying way in which this is better than tests that only fail some of the time.\n\nThe problem started to look more and more like a trap set for enthusiastic developers. As a manager I always had to urge caution: “sure, you can see some approaches that could help, but bear in mind the last five times anyone tried they found very promising angles that didn’t change the stats in github at all”. Developers whom I trust were seriously recommending deleting the entire group.\n\n##\n[\nOpus “solved it” overnight\n](#opus-quotsolved-itquot-overnight)\n\nOne night, Opus 4.6 running in Claude Code solved “the problem” by running the flaky test group hundreds of times and analyzing failures. There was some prompting to help Claude avoid premature conclusions and be aware that the problems could not be reproduced without repetition, plus a markdown file where it would record progress. Otherwise, no special magic.\n\nI could see Claude’s progress over time because it needed to run the flaky group in larger and larger batches. At first, five times was sufficient because the errors it found occurred 20% of the time. As those were fixed, I had to tell it to use batches of ten, fifty, and then one hundred. Finally, it reached a point where zero errors were found.\n\nA “nice” thing about needing such large batches is that I could leave Claude alone for hours at a time while my normal evening continued. Flaky specs may be a problem uniquely suited to coding agents in that way. There’s not even much token use: it just kicks off a long run and surfaces for an internal conversation, then kicks off the next batch.\n\n##\n[\nTwo weeks to make the results useful\n](#two-weeks-to-make-the-results-useful)\n\nThis isn’t a post about test failure strategy, so I’ll spare you details of what was flaky and what fixes applied. Instead I’ll try to communicate some of the meta concerns I had with the resulting code changes.\n\nGiven a test that looked something like this:\n\n```\n1 create objects\n2 visit page\n3 click A\n4 click B\n5 expect expression 1 to be true\n6 click C\n7 expect expression 2 to be true\n```\n\nUnchecked, Claude would have turned it into something like this:\n\n```\n1  create objects in a slightly different way that makes no difference\n2  visit page\n3  explicit sleep\n4  unnecessary scoping to a specific section of the page\n5    click A\n6  end of unnecessary scoping\n7  click B, with 3 second wait passed as option arg\n8  a clever improvement that should have been on line 3\n9  expect expression 1 to be true\n10 click C\n11 an improvement that worked in other tests but was irrelevant here\n12 expect expression 2 to be true\n```\n\nUltimately the changes added up to a good improvement, usually because of one crucial addition per test (in our fictional example, line 8) that was on the wrong line and hidden in a mountain of garbage (lines 3, 4, 6, 7, 11).\n\nIt took two weeks to:\n\n- separate coincidence from real results\n- remove the things that didn’t make a difference\n- apply good practice to the important differences\n- unify slight variations on the same changes\n- generalise to other parts of the test suite\n- make sensible commits\n\nSome of this work was just a matter of applying good practice (e.g. any explicit sleep call is immediately suspect), and other times it was sending Claude back to hundreds of test runs to prove that something it had added made no difference.\n\n##\n[\nConclusion: processing my reactions\n](#conclusion-processing-my-reactions)\n\nI see in myself three reactions.\n\n###\n[\n1. Hooray, I’m still useful as a programmer!\n](#1-hooray-i39m-still-useful-as-a-programmer)\n\nI think it would have been impossible without lots of experience working with Rails and rspec to move from what Claude was suggesting initially towards something sustainable 2. The exact amount of experience necessary is uncertain, but I’m on more than ten years. It took a lot to move beyond the optimism and false positives, and it would have taken more if I didn’t already have a reasonable gut instinct about these things.\n\n###\n[\n2. Boy, AI is awful! Why bother with it if it takes so long to use the results?\n](#2-boy-ai-is-awful-why-bother-with-it-if-it-takes-so-long-to-use-the-results)\n\nI would absolutely use (and recommend) Claude for analysing flaky tests again. I think it would be a mistake not to do so. Accurately running long processes with tiny changes in between multi-hour waits is not a strength for humans.\n\nIn addition, Claude did reason through code running in parallel processes in a way that no human had managed for years. That particular part of our code is complex, but has not had active work for years, meaning that no human has good context. Claude probably caught up in 10 minutes.\n\nAn interesting aside here is that I find Claude to do much better work when it has tests to help it reason about application code. The tests were flaky, but they were still a good record of what the code was supposed to do.\n\n###\n[\n3. Why keep going for two weeks after AI clearly fixed the problem I care about in one evening?\n](#3-why-keep-going-for-two-weeks-after-ai-clearly-fixed-the-problem-i-care-about-in-one-evening)\n\nI could have taken the win, ignored the cruft, and gained two weeks. If I had, I would have lost those two weeks and more later on. Humans and AI agents would cargo cult the new (anti) patterns, falsely claiming victory over any future flakiness, and making it harder to identify the real problems.\n\nAs with all programming, eventually “tidy first, then do the work” ends up being faster than “just do the work”. There’s no escaping the tidying if I want good results, the question is whether I do it at a predictable time and pace or when there’s an emergency (like no-one being able to deploy any code because CI keeps failing).\n\nThat includes tidying up after AI.", "url": "https://wpnews.pro/news/ai-s-overnight-solution-for-our-flaky-tests-took-two-weeks-to-adopt", "canonical_source": "https://feed.thoughtbot.com/link/24077/17364998/what-it-took-to-use-this-overnight-ai-solution", "published_at": "2026-06-22 00:00:00+00:00", "updated_at": "2026-06-22 00:25:46.345955+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-tools", "ai-agents", "developer-tools"], "entities": ["Claude Code", "Opus 4.6", "Anthropic", "Playwright", "Stimulus", "Hotwire"], "alternates": {"html": "https://wpnews.pro/news/ai-s-overnight-solution-for-our-flaky-tests-took-two-weeks-to-adopt", "markdown": "https://wpnews.pro/news/ai-s-overnight-solution-for-our-flaky-tests-took-two-weeks-to-adopt.md", "text": "https://wpnews.pro/news/ai-s-overnight-solution-for-our-flaky-tests-took-two-weeks-to-adopt.txt", "jsonld": "https://wpnews.pro/news/ai-s-overnight-solution-for-our-flaky-tests-took-two-weeks-to-adopt.jsonld"}}