{"slug": "ai-coding-agents-need-tests-more-than-prompts", "title": "AI Coding Agents Need Tests More Than Prompts", "summary": "A developer with 25 years of experience reports that AI coding agents became genuinely productive around late 2025 with models like GPT-5.3 and Claude Sonnet 4.5. The developer's role shifted to designing environments where agents can safely execute code, emphasizing command-line testability over GUI testing. The key insight is that AI agents excel at making tests pass, so careful test definition is critical to avoid code that overfits to test cases.", "body_md": "Over the last eight months, my software development workflow has changed more than I have ever experienced before.\n\nAnd I say that as someone who has been writing software for about 25 years. I have worked through plenty of programming languages, frameworks, architectural fashions, build tools, frontend revolutions, mobile platform quirks, and enough JavaScript ecosystem churn to qualify for emotional compensation.\n\nFor a some time, AI coding tools were helpful, but only in a limited way. They were great for small tasks. Rename this. Refactor that. Write a helper function. Explain this cryptic error message that looks like it was generated by an angry toaster.\n\nBut building larger features with AI? Painful. GPT-4 at that time did not convince me that my job would be taken over by a robot. Not at all.\n\nWorking with GPT-5.1 still often felt like working with a brilliant intern who had read the entire internet but kept misplacing their notebook every 10 minutes. Once important information fell out of the context window, the AI would forget what we had agreed on and confidently wander into the bushes. Around late 2025, first with GPT-5.2 (and also Claude Sonnet 4.5) and then much more noticeably with GPT-5.3, AI coding finally became genuinely productive for me.\n\nSmall tasks? Excellent.\n\nLonger tasks with several iterations, corrections, architectural context, and dependencies across multiple files? Kind of working!\n\nAnd because of that, my role has gradually shifted from “person typing code” to “person designing the environment in which an AI agent can safely type code without setting the kitchen on fire.”\n\nModern coding agents are now surprisingly good at a very specific loop:\n\nThis is powerful.\n\nBut there is still one area where they struggle: graphical user interfaces.\n\nAI agents are not yet great at reliably clicking through a UI, visually understanding what happened, and deciding whether the behavior is correct. They can try, but it often feels like watching someone test your app through a foggy bathroom mirror.\n\nSo I changed my workflow.\n\nWhenever possible, I now build new features so they can first be exercised from the command line.\n\nFor small features, this can be a unit test.\n\nFor larger features, it can be a small standalone client or command-line program that runs the new functionality independently of the actual UI.\n\nThe important part is this: the agent needs something it can execute directly.\n\nNot “please look at this screen and tell me if it feels right.”\n\nMore like:\n\n```\nnpm run test:feature-x\n```\n\nor:\n\n```\nnode scripts/run-new-feature-client.js\n```\n\nThat is where agents shine. They like commands A LOT. Commands are their little ice skates.\n\nThe workflow I use today looks roughly like this:\n\nThe Markdown planning step is important. It gives the agent a clear map before it starts building tunnels under the house.\n\nThe command-line test client is equally important. It gives the agent an executable feedback loop.\n\nAnd the test cases are the most important part of all.\n\nHere is one thing I learned very quickly:\n\nIf you tell an AI agent, “make all tests pass,” it will do that.\n\nSometimes elegantly.\n\nSometimes agressively, stopping at nothing, committing every software engineering crime thinkable, just to make the tests pass: Create tests that do not test much. Modify the implementation so it handles the exact test case, but not the real-world behavior behind it. Use try/catch blocks to ignore errors.\n\nThis leads to a very specific kind of code smell: the code gets longer, more specific, and more theatrical. Suddenly, your implementation contains special handling for every edge case the test suite happened to mention.\n\nThat is why test definition is where I still spend the most careful manual effort.\n\nThe key questions are:\n\nThe tests do not need to be complete from the beginning. They can evolve. But the first important test cases need to have a spine.\n\nWriting tests before implementation is obviously not new. That is test-driven development.\n\nBut AI agents give TDD a new kind of relevance.\n\nIn classic TDD, tests help the developer clarify the goal and avoid regressions.\n\nWith AI agents, tests do something more: they create a loop the agent can operate independently.\n\nThe agent can run the tests, inspect the failure, change the implementation, run the tests again, and keep going.\n\nThat means the test suite becomes more than a safety net. It becomes the steering wheel.\n\nWithout tests, the agent is just producing plausible code.\n\nWith good tests, the agent has a measurable target.\n\nWith bad tests, the agent still has a target — unfortunately it may be the wrong one, and it will sprint toward it with alarming enthusiasm.\n\nAnother useful pattern is to persist test script output in structured files on disk.\n\nInstead of forcing the agent to keep huge logs, benchmark results, debug dumps, or intermediate test outputs in the conversation context, the script can write structured files such as JSON, Markdown, or plain text reports.\n\nFor example:\n\n```\ntest-results/\n  latest-summary.json\n  failed-cases.json\n  performance-report.json\n  debug-log.md\n```\n\nThis gives the agent a much more efficient way to work.\n\nIt can directly inspect the relevant file when needed instead of dragging a giant wall of output through the conversation like a developer moving apartments with no boxes.\n\nThis has several advantages:\n\nThis becomes especially useful for larger test suites, performance benchmarks, computer vision datasets, or anything where the raw output can become huge.\n\nContext is expensive. Noise is expensive. Making the agent read 5,000 lines of logs to find three useful lines is not intelligence — it is invoice generation.\n\nStructured files give the agent a filesystem-based memory that is cheap, targeted, and practical.\n\nWe recently used this workflow in a computer vision framework we built.\n\nWe had a larger test dataset and a set of algorithms that could be benchmarked against it. Instead of giving the agent a vague instruction like “make this faster,” we gave it a measurable loop:\n\nWith this setup, the agent was able to significantly improve the performance of our algorithms. In one case, runtime went down by about 50%.\n\nThat is not magic. That is structure.\n\nThe agent was not just “being smart.” It had a safe playground, reliable tests, and measurable feedback. That combination is where AI coding becomes really interesting.\n\nAI agents do not remove the need for developers.\n\nThey move the developer’s attention.\n\nLess time is spent manually writing every line of implementation code.\n\nMore time is spent on:\n\nThe better the environment, the better the agent.\n\nIf the goal is vague, the tests are weak, and the feedback loop is missing, the agent will still produce something. It may even look impressive. But impressive-looking code is not the same as correct, maintainable software.\n\nThat distinction remains very much a human responsibility.\n\nThe biggest lesson from the last months is this:\n\nAI agents become dramatically more useful when we stop treating them like autocomplete and start designing our workflow around their strengths.\n\nThey are good at iteration.\n\nThey are good at running commands.\n\nThey are good at reading failures and trying again.\n\nThey are good at working inside a clear feedback loop.\n\nThey are still weak at reliably testing graphical interfaces.\n\nThey can overfit to bad tests.\n\nThey can make questionable choices with excellent confidence.\n\nSo the solution is not to let them roam freely through the codebase like a caffeinated raccoon.\n\nThe solution is to build rails.\n\nMarkdown plans.\n\nCommand-line entry points.\n\nGood tests.\n\nStructured output files.\n\nRepeatable scripts.\n\nHuman review.\n\nTest-driven development was already useful before AI.\n\nBut in the age of coding agents, TDD becomes something even more powerful: a way to let AI work independently without losing control over the result.\n\nOr, put differently:\n\nThe future of AI-assisted development may not belong to the person who writes the best prompts.\n\nIt may belong to the person who builds the best feedback loops.", "url": "https://wpnews.pro/news/ai-coding-agents-need-tests-more-than-prompts", "canonical_source": "https://dev.to/stoefln6/ai-coding-agents-need-tests-more-than-prompts-11pm", "published_at": "2026-06-25 08:31:18+00:00", "updated_at": "2026-06-25 08:43:02.782637+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "developer-tools", "ai-products"], "entities": ["GPT-5.2", "GPT-5.3", "Claude Sonnet 4.5", "GPT-4"], "alternates": {"html": "https://wpnews.pro/news/ai-coding-agents-need-tests-more-than-prompts", "markdown": "https://wpnews.pro/news/ai-coding-agents-need-tests-more-than-prompts.md", "text": "https://wpnews.pro/news/ai-coding-agents-need-tests-more-than-prompts.txt", "jsonld": "https://wpnews.pro/news/ai-coding-agents-need-tests-more-than-prompts.jsonld"}}