# AI Coding Agents Need Tests More Than Prompts

> Source: <https://dev.to/stoefln6/ai-coding-agents-need-tests-more-than-prompts-11pm>
> Published: 2026-06-25 08:31:18+00:00

Over the last eight months, my software development workflow has changed more than I have ever experienced before.

And I say that as someone who has been writing software for about 25 years. I have worked through plenty of programming languages, frameworks, architectural fashions, build tools, frontend revolutions, mobile platform quirks, and enough JavaScript ecosystem churn to qualify for emotional compensation.

For a some time, AI coding tools were helpful, but only in a limited way. They were great for small tasks. Rename this. Refactor that. Write a helper function. Explain this cryptic error message that looks like it was generated by an angry toaster.

But building larger features with AI? Painful. GPT-4 at that time did not convince me that my job would be taken over by a robot. Not at all.

Working with GPT-5.1 still often felt like working with a brilliant intern who had read the entire internet but kept misplacing their notebook every 10 minutes. Once important information fell out of the context window, the AI would forget what we had agreed on and confidently wander into the bushes. Around late 2025, first with GPT-5.2 (and also Claude Sonnet 4.5) and then much more noticeably with GPT-5.3, AI coding finally became genuinely productive for me.

Small tasks? Excellent.

Longer tasks with several iterations, corrections, architectural context, and dependencies across multiple files? Kind of working!

And because of that, my role has gradually shifted from “person typing code” to “person designing the environment in which an AI agent can safely type code without setting the kitchen on fire.”

Modern coding agents are now surprisingly good at a very specific loop:

This is powerful.

But there is still one area where they struggle: graphical user interfaces.

AI agents are not yet great at reliably clicking through a UI, visually understanding what happened, and deciding whether the behavior is correct. They can try, but it often feels like watching someone test your app through a foggy bathroom mirror.

So I changed my workflow.

Whenever possible, I now build new features so they can first be exercised from the command line.

For small features, this can be a unit test.

For larger features, it can be a small standalone client or command-line program that runs the new functionality independently of the actual UI.

The important part is this: the agent needs something it can execute directly.

Not “please look at this screen and tell me if it feels right.”

More like:

```
npm run test:feature-x
```

or:

```
node scripts/run-new-feature-client.js
```

That is where agents shine. They like commands A LOT. Commands are their little ice skates.

The workflow I use today looks roughly like this:

The Markdown planning step is important. It gives the agent a clear map before it starts building tunnels under the house.

The command-line test client is equally important. It gives the agent an executable feedback loop.

And the test cases are the most important part of all.

Here is one thing I learned very quickly:

If you tell an AI agent, “make all tests pass,” it will do that.

Sometimes elegantly.

Sometimes agressively, stopping at nothing, committing every software engineering crime thinkable, just to make the tests pass: Create tests that do not test much. Modify the implementation so it handles the exact test case, but not the real-world behavior behind it. Use try/catch blocks to ignore errors.

This leads to a very specific kind of code smell: the code gets longer, more specific, and more theatrical. Suddenly, your implementation contains special handling for every edge case the test suite happened to mention.

That is why test definition is where I still spend the most careful manual effort.

The key questions are:

The tests do not need to be complete from the beginning. They can evolve. But the first important test cases need to have a spine.

Writing tests before implementation is obviously not new. That is test-driven development.

But AI agents give TDD a new kind of relevance.

In classic TDD, tests help the developer clarify the goal and avoid regressions.

With AI agents, tests do something more: they create a loop the agent can operate independently.

The agent can run the tests, inspect the failure, change the implementation, run the tests again, and keep going.

That means the test suite becomes more than a safety net. It becomes the steering wheel.

Without tests, the agent is just producing plausible code.

With good tests, the agent has a measurable target.

With bad tests, the agent still has a target — unfortunately it may be the wrong one, and it will sprint toward it with alarming enthusiasm.

Another useful pattern is to persist test script output in structured files on disk.

Instead of forcing the agent to keep huge logs, benchmark results, debug dumps, or intermediate test outputs in the conversation context, the script can write structured files such as JSON, Markdown, or plain text reports.

For example:

```
test-results/
  latest-summary.json
  failed-cases.json
  performance-report.json
  debug-log.md
```

This gives the agent a much more efficient way to work.

It can directly inspect the relevant file when needed instead of dragging a giant wall of output through the conversation like a developer moving apartments with no boxes.

This has several advantages:

This becomes especially useful for larger test suites, performance benchmarks, computer vision datasets, or anything where the raw output can become huge.

Context is expensive. Noise is expensive. Making the agent read 5,000 lines of logs to find three useful lines is not intelligence — it is invoice generation.

Structured files give the agent a filesystem-based memory that is cheap, targeted, and practical.

We recently used this workflow in a computer vision framework we built.

We had a larger test dataset and a set of algorithms that could be benchmarked against it. Instead of giving the agent a vague instruction like “make this faster,” we gave it a measurable loop:

With this setup, the agent was able to significantly improve the performance of our algorithms. In one case, runtime went down by about 50%.

That is not magic. That is structure.

The agent was not just “being smart.” It had a safe playground, reliable tests, and measurable feedback. That combination is where AI coding becomes really interesting.

AI agents do not remove the need for developers.

They move the developer’s attention.

Less time is spent manually writing every line of implementation code.

More time is spent on:

The better the environment, the better the agent.

If the goal is vague, the tests are weak, and the feedback loop is missing, the agent will still produce something. It may even look impressive. But impressive-looking code is not the same as correct, maintainable software.

That distinction remains very much a human responsibility.

The biggest lesson from the last months is this:

AI agents become dramatically more useful when we stop treating them like autocomplete and start designing our workflow around their strengths.

They are good at iteration.

They are good at running commands.

They are good at reading failures and trying again.

They are good at working inside a clear feedback loop.

They are still weak at reliably testing graphical interfaces.

They can overfit to bad tests.

They can make questionable choices with excellent confidence.

So the solution is not to let them roam freely through the codebase like a caffeinated raccoon.

The solution is to build rails.

Markdown plans.

Command-line entry points.

Good tests.

Structured output files.

Repeatable scripts.

Human review.

Test-driven development was already useful before AI.

But in the age of coding agents, TDD becomes something even more powerful: a way to let AI work independently without losing control over the result.

Or, put differently:

The future of AI-assisted development may not belong to the person who writes the best prompts.

It may belong to the person who builds the best feedback loops.