My AI Agent can finally write data β assign beds, update predictions, create alerts, place orders.
Code done, I got stuck on a question:
How do you test something like this?
Regular functions are easy to test. add(1, 1)
always returns 2
. Same every time. I can write assert add(1, 1) == 2
.
But Agents are different. Ask it "find me an empty bed":
Results are all correct, but the process differs.
You can't write:
assert agent("find an empty bed") == "some specific sentence"
Because it says something different every time.
Thought about it for a while. Found the answer somewhere unexpected: news fact-checking.
How do journalists verify a report's accuracy?
They don't verify "how the reporter gathered information" β how many calls they made, how many sites they visited, how many people they talked to. Process is too complex. Every reporter does it differently.
What they verify is results:
Doesn't matter how the reporter got the information. As long as the final reported facts are accurate, it passes.
AI Agents can be tested the same way.
The core framework is just three steps:
| Step | What to Do | Analogy |
|---|---|---|
| Before | Check initial system state | What things looked like before the event |
| Action | Let Agent execute the operation | Reporter goes to investigate |
| After | Check final system state | Verify if the report is accurate |
Example with "transfer bed" functionality:
before_bed = query("SELECT bed_id FROM admission WHERE patient_name='Zhang San'")
available_beds = query("SELECT * FROM bed WHERE room_type='postpartum' AND status='empty'")
agent("Transfer Zhang San to postpartum ward")
after_bed = query("SELECT bed_id FROM admission WHERE patient_name='Zhang San'")
assert after_bed in available_beds
The key here:
I don't care how the Agent did it. I only care whether the system state is correct after it's done.
Agent wants to query bed table first or room table first? Up to it. As long as Zhang San ends up in the right bed, it's correct.
This method has several advantages:
Agent's "process" might differ each time, but the "result" is deterministic.
No need to mock Agent's internal behavior. No intercepting API calls. No verifying generated SQL.
Just query database state directly: what it looked like before, what it looks like after.
If the Agent upgrades later (new model, changed prompt), as long as results are still correct, tests pass.
Internal implementation can change freely. Interface contract stays stable.
One detail worth mentioning: test fixtures.
Write operations change database state. Without isolation between tests, later tests get polluted by earlier ones.
Solution: restore database to initial state before each test:
@pytest.fixture(autouse=True)
def reset_database():
restore_from_backup()
yield
Like fact-checking: each time you verify a new piece of news, you start fresh, unaffected by previous checks.
Testing solves "verifying correctness during development."
But there's another problem: nobody runs tests in production.
Agent might execute hundreds of operations daily after going live. How do you review when something goes wrong?
Like news organizations needing to keep interview recordings, raw files, edit history β audit trails.
AI Agents need the same:
I researched this β the field is called Governance or Audit. The Strands framework has interfaces for writing logs to S3, DynamoDB, etc.
Didn't implement this time β current priority is getting Agent core working. But I noted this requirement. Must do before launch.
Knowing what needs to be done matters more than doing it right now.
How to test systems with uncertainty:
Don't test process, test results. Before β Action β After.
This framework applies beyond AI Agents. Any system that "changes world state" can be tested this way:
Simple frameworks are often the most useful.