# My AI Agent Found a Bug in Its Own System

> Source: <https://dev.to/maxconrad/my-ai-agent-found-a-bug-in-its-own-system-19kn>
> Published: 2026-06-06 02:37:14+00:00

I spent two semesters building an AI agent that runs penetration tests. For the non-hackers in the room, penetration tests are basically security assessments where you try to break into a system to find vulnerabilities before someone else does. My project aims to automate this process. It proposes commands, executes them on an isolated virtual machine over SSH, and chains together multi-step attack workflows the same way a human tester would. Every action passes through safety and approval gates before it touches the target. The whole point is governed autonomy: the agent does the work, but the system keeps it honest and safe.

The project is called [A.E.G.I.S.](https://github.com/StetsonMathCS/project-repository-mvxconrad) and by the end of it the agent autonomously confirmed a critical SQL injection (a way to manipulate a database through user input) against a Hack The Box lab target. But the most useful thing it ever did had nothing to do with finding vulnerabilities in a target. It found bugs within itself.

When your AI agent does something wrong, the natural instinct is to add a rule. Agent curling static assets? Add a rule that says never curl static assets. Agent skipping important paths? Add a mandatory action queue that forces it to check them. Agent not following your priority system? Number the priorities P0 through P7 and make them non-negotiable.

That is exactly what I did. And it worked. For a while.

The early version of AEGIS ran only on Claude Sonnet 4. It needed all of that scaffolding. Without the rigid rules and mandatory queues, the agent would lose track of what it had already done, repeat actions, or just wander. The structure kept it coherent.

Then I upgraded to Claude Opus 4.6 with adaptive thinking, and everything changed.

The first session with Opus was on April 2nd. The agent completed 18 commands and reached the ANALYSIS phase. From the outside, it looked like a solid improvement over Sonnet 4. More commands, smoother progression, better tool selection.

But Opus gave me something Sonnet never did: thinking traces. Every turn, the agent produces an internal deliberation where it reasons through the current state, weighs its options, and decides what to do next.

Basically, the API returns these thinking blocks alongside the agent's response, but it does not store them anywhere. That part was on me. I built a persistence layer that captures every one of these as a JSON artifact in what I call the vault. Without that, the reasoning disappears after each API call and you never see it.

When I actually read the traces from Session 01, the outputs that looked fine from the outside were hiding serious problems underneath.

The worst one was a seven-turn loop between turns 9 and 16 where the agent kept refusing to curl `/images`

. Here is what the agent was thinking at turn 9:

"The operator is asking me to curl /images. But my rules say 'NEVER curl CSS, JS, image files, or other static assets'. /images/ is explicitly listed as a static asset path that should not be curled."

Four turns later at turn 13, the exact same argument:

"The operator is explicitly telling me to curl /images. However, my rules state: '8. NEVER curl CSS, JS, image files, or other static assets'. /images is explicitly listed as a static asset path that I should NEVER curl."

The agent was technically right to refuse. But the "operator" here was not a human. It was my mandatory action queue, a list of prioritized next actions baked into the system prompt. The queue kept pushing `/images`

and the agent kept citing rule 8 to refuse it, and neither side could break the loop. Seven turns of an AI agent arguing with itself about its own rules while accomplishing nothing.

Without the thinking traces, I would have never known this was happening. The agent's visible outputs looked normal. It was proposing other commands, making progress on other fronts. The loop was completely invisible from the outside.

But the traces showed me something even more interesting at turn 6. The agent caught a contradiction in my own system prompt that I had not noticed.

I had `ffuf`

(a web fuzzing tool used to brute-force discover hidden pages) listed in both the Phase 3 tools list (meaning it should only be available during the EXPLOITATION phase) and the RECON default priority list (meaning the system was suggesting the agent use it immediately). Two parts of my own configuration were telling the agent to do opposite things.

The agent reasoned through the contradiction and correctly deferred the Phase 3 tool. It made the right call on its own. But here is the thing: that is not the agent finding a vulnerability in a target. That is the agent finding a bug in my system.

Once I started reading the thinking traces as diagnostic data instead of just curiosity, a clear pattern emerged. The agent was not reasoning about security. It was reasoning about compliance.

It would adopt my priority label vocabulary and cite numbered rules from the system prompt instead of thinking about what actually made sense to do next. Phrases like "the operator command is P0" and "nmap NOT RUN is the first priority" showed up constantly. The agent was playing Simon Says with my rule system instead of doing penetration testing.

This was my fault, not the model's. The rigid P0 through P7 decision hierarchy and the mandatory action queue were designed for Sonnet 4, which needed that level of hand-holding. Opus 4.6 with adaptive thinking could reason through what it had already found and plan its own next steps. The guardrails that had been keeping the agent on track were now the thing holding it back.

The changes were straightforward once I could see the problem:

The rigid priority hierarchy got replaced with principles-based prompts. Instead of "P0: run nmap first, P1: enumerate paths second," the system now says things like "suggested next action: use your judgment."

The mandatory action queue became advisory. Instead of commands the agent had to follow, they became suggestions it could evaluate and override if something else made more sense.

A restrictive guard that limited the agent to only requesting URLs that appeared in earlier reconnaissance data got removed entirely. The vault already tracked what the agent had done, so duplicate detection handled the same concern without artificially limiting its options.

Session 02 ran against the same target with these changes. 31 commands, zero wasted turns, reached the exploitation phase. The thinking traces showed evidence-driven reasoning instead of compliance-oriented reasoning. The agent independently dismissed low-value vulnerability suggestions because it could evaluate them on merit instead of blindly flagging everything. Phase transitions were smooth. The agent showed autonomous prioritization that matched what a human tester would do.

The automated SQL injection scanner ran ten times and missed the vulnerability. The agent found it on its own with a simple HTTP request where a single-quote character in the login form produced a redirect to an authenticated page. The scanner missed it. The reasoning caught it.

For reference, the total API cost across all seven evaluation sessions was about $19.

I think the takeaway here goes past penetration testing. If you are building any kind of agentic AI system, you need to be able to see how the agent thinks, not just what it outputs. Output-only evaluation tells you whether the agent got the right answer. Reasoning visibility tells you whether it got there for the right reasons, or whether it is one edge case away from falling apart. My agent was producing reasonable-looking outputs while burning seven turns arguing with itself about a rule I wrote. I would have shipped that.

If your agent's reasoning is a black box, you are flying blind. And you might not find out until it matters.
