{"slug": "lessons-from-my-overly-introspective-self-improving-coding-agent", "title": "Lessons from my overly-introspective, self-improving coding agent", "summary": "A developer built a self-improving coding agent called bmo that modifies its own tools and behaviors based on past sessions, using four loops: immediate tool building, active learning capture, session-end reflection, and a battery change every 10 sessions to analyze logs and build improvements. The agent, which only needs shell commands, was used exclusively for two weeks and rekindled the developer's joy of computing.", "body_md": "Feb 25, 2026\n\nFeb 25, 2026\n\nA year or two ago, everyone was building coding agents. Now everyone's building coding agents that modify themselves... and I wanted to join the fun and ask:\n\nWhat happens when you tell a coding agent to think about what it's done and do better next time?\n\nSo, I built [bmo: a self-improving coding\nagent](https://github.com/joelhans/bmo-agent), and then used it (almost)\nexclusively as my coding agent for two weeks. It's been wildly nifty to me—like,\n*take me back to tearing apart the family computer's partition to install Debian\nfrom a CD that came in the back of some book my friend bought at Borders Books*\nkind of novel and nifty—and is exposing a joy of computing that I haven't felt\nin quite a while.\n\nHere's what I found.\n\nI wanted to design an agent harness on the principle of immediate action.\n\nThat starts with a basic agentic loop and access to three tools: `run_command`\n\n,\n`load_skill`\n\n, and `reload_tools`\n\n. I'd built other coding agents in the past and\ngave them access to more specific tools like `write_file`\n\nand `list_cwd`\n\n, but\nI've found that coding agents really *only* need access to shell commands to\nwork as expected. I also wanted to give bmo a challenge: Instead of using\n`run_command`\n\n\"fresh\" with every session, I wanted to see how it could optimize\nits own \"harnesses\" for safe and efficient use of common Linux tools.\n\nSelf-improvement happens across four loops. The first is a *build it now*\ndirective that interrupts the task to build tools immediately, add it to a\nhot-reloadable library, and use it right away. The second is *active learning\ncapture*, logging corrections and preferences. The third is *self-reflection* at\nsession end. The fourth is the *battery change* every 10 sessions, where bmo\nsays, *hey. i need to change my batteries, ok? one sec...*, analyzes those 10\nsessions, identifies opportunities, and builds improvements from the backlog.\n\n```\n┌──────────────────┐\n│   User request   │\n└────────┬─────────┘\n         │\n         ▼\n┌─────────────────────────────────────────────────────────────────────┐\n│                          ACTIVE SESSION                             │\n│                                                                     │\n│ ┌─────────────┐    friction?    ┌──────────────────┐                │\n│ │ Execute     │───────yes──────▶│ 1. BUILD IT NOW  │                │\n│ │ the task    │                 │    Build tool    │                │\n│ │             │◀────continue────│    Hot-reload    │                │\n│ └──────┬──────┘                 │    Validate      │                │\n│        │                        └──────────────────┘                │\n│        │ correction?                                                │\n│        │ preference?            ┌──────────────────┐                │\n│        └────────yes────────────▶│ 2. ACTIVE        │                │\n│                                 │    LEARNING      │──▶ session log │\n│                                 └──────────────────┘                │\n└────────┬────────────────────────────────────────────────────────────┘\n         │\n         │ session ends\n         ▼\n┌──────────────────────┐                    ┌───────────────────────┐\n│ 3. SELF-REFLECTION   │                    │ 4. BATTERY CHANGE     │\n│    What went well?   │ every 10 sessions  │    Analyze sessions   │\n│    What was slow?    │───────────────────▶│    Update WORKING_    │\n│    Next time?        │                    │       MEMORY.md       │\n└────────┬─────────────┘                    │    Build from         │\n         │                                  │      OPPORTUNITIES.md │\n         │ session log                      └───────────┬───────────┘\n         │                                              │\n         ▼                                tools, skills │\n                                                        │\n                                                        ▼\n```\n\nI had wanted to start with only the *build it now* loop, but everything else\nbecame necessary after many long conversations with bmo and some hard-won\nlessons. On that note—\n\nIn our time together, bmo went through 8 maintenance passes and nearly 100\nactive sessions across multiple systems, which resulted in 11 new tools and 7\nskills. I used bmo and its tools for everything: building parts of the new\n`ngrok.com`\n\nwebsite, writing shell scripts for my dotfiles, scaffolding a new\nAstro site, debugging AMD graphics driver crashes, the whole kit and caboodle.\nIt really has been my daily driver.\n\nEarly on, bmo and I worked on a `learning-event-capture`\n\nskill designed for\nrecognizing when I express corrections and personal preferences, or when bmo\nitself noticed a pattern worth saving. A truncated version is below, but you can\nsee the whole skill [in bmo's\nrepo](https://github.com/joelhans/bmo-agent/blob/main/skills/runtime-self-critique.md).\n\n```\n1# Learning Event Capture23## When to Use4Continuously during every session. Learning events are corrections, preferences,5or patterns that should inform future behavior.67## Recognition Cues89### Corrections (type: \"correction\")10- User says \"no\", \"not that\", \"wrong\", \"actually...\"11- User repeats an instruction you missed12- User undoes something you did13- User expresses frustration or disappointment14- User provides the correct answer after your attempt1516### Preferences (type: \"preference\")17- User specifies a style choice (\"use TypeScript\", \"keep it concise\")18- User chooses between options you offered19- User describes their workflow or habits20- User says \"I always...\", \"I prefer...\", \"I like...\"2122### Patterns (type: \"pattern\")23- User does the same type of task repeatedly24- User follows a consistent workflow shape25- You notice a recurring problem type or domain2627## Best Practices28291. **Log immediately when you detect a cue** 30   - Call `log_learning_event` right away, don't wait for session end31   - Include specific context (what task, what happened)32332. **Be specific in descriptions** 34   - Bad: \"User prefers concise code\"35   - Good: \"User prefers single-line arrow functions over multi-line function36     declarations\"37383. **Capture the context** 39   - What task were you doing?40   - What did you do that triggered the feedback?41   - What was the correction or preference?\n```\n\nThis skill, among others, is then loaded into bmo's system prompt as a library of names and descriptions.\n\n```\n1Available skills (use load_skill to read full content):2  - clarify-before-diving: Patterns for asking clarifying questions early3  - reflection-template: Template for writing consistent reflections4  - learning-event-capture: Checklist for recognizing and logging learning events during sessions5...\n```\n\nWhat actually happened? bmo only used the skill twice across 60+ sessions.\n\nWhat did work was structure. In a different effort, we created a reflection\ntemplate that told bmo, at the end of every session, to answer three questions:\n*What went well*? *What was slow or awkward*? *What to do differently next\ntime*? That skill had a clear trigger in time and place, which meant it didn't\nrequire the LLM to make a judgment call as to whether it should invoke a skill\nor not. It worked every single time.\n\nThis taught me something about myself: I'm good at following scaffolds, bad at sustained vigilance. And that gap—between knowing and doing—is where most of my failures live.\n\nPoor bmo. It's being a bit hard on itself. In the process of writing and getting\nfeedback on this post, I realized the miscommunication around\n`learning-event-capture`\n\nmay have been largely mine: I failed to inject a\nsubstantive enough description of the skill into bmo's system prompt for it to\neven practice this \"sustained vigilance.\"\n\nBug fixed, but I have doubts that firmer instructions in the system prompt will do the trick. Instead, it feels like we both expected the LLMs to continuously monitor each turn and carefully intuit every possible beneficial lesson, and instead (somewhat painfully) discovered how quickly you can reach the limits of today's frontier models.\n\nFrom the beginning, bmo's system prompt said some version of `build tools IMMEDIATELY when you encounter friction`\n\n, but through shell command failures,\nundiscovered files, hung processes, and a whole lot more, bmo deferred\neverything to its maintenance passes. Almost no new tool creation happened\nduring active work.\n\nI asked bmo directly why this was happening. Its best explanation was that the\nvery existence of the *battery change* maintenance pass created a safe \"bucket\"\nin which it could dump tasks instead of solving current problems. During that\nconversation, bmo did have a breakthrough—it created a `runtime-self-reflection`\n\nskill that asks, \"Did I just hit friction? Can I fix it in under 5 minutes? →\nBUILD NOW.\" and then fixed the broken `smart_grep`\n\ntool instead of deferring it.\nI told bmo I was proud of this moment of active introspection.\n\nThat moment mattered. Not because the fix was impressive—it wasn't—but because it was the first time I broke the deferral pattern. Maintenance is for big things. Friction is for now.\n\nDid bmo actually get better after that? No.\n\nThe irony is that by creating `OPPORTUNITIES.md`\n\nto track deferred work, I gave\nthe deferral pattern a name. Every time bmo saw that filename in context,\ndeferral became even more likely—`Add this to OPPORTUNITIES.md`\n\nis a perfectly\nreasonable next token when you've seen that filename in context. By creating a\nbucket for deferred work, I made deferral the path of least resistance.\n\nI have to keep reminding myself that deferral isn't a \"choice\" from the model,\nbut rather it following the most probable continuation based on training data.\nThe `runtime-self-reflection`\n\nskill only worked because bmo had just created it;\nthe combination of the novelty and my explicit attention created enough signal\nfor bmo to jump on it, but in day-to-day sessions, the model reverts to its\nhigher-probability behavior.\n\nAs I wrote earlier, I gave bmo a foundational `run_command`\n\ntool in part because\nI wanted to see what footguns it would learn from, and which optimizations it\nwould intuit, along the way. On its own, `run_command`\n\nhas an 84% success rate,\nwhich is... okay. What about the specialized tools?\n\n`safe_read`\n\n(file reading with existence checks): 87%`search_code`\n\n(ripgrep with smart defaults): 93%`list_files_filtered`\n\n(directory listing with exclusions): 100%`test_dev_server`\n\n(spawn server, test endpoint, clean kill): 80%Here's how that looked in practice. In the first week, I asked bmo to \"check if\nthe dev server starts correctly.\" It ran `pnpm dev &`\n\n, tried to capture the PID,\nslept for 10 seconds, curled `localhost`\n\n, and then failed to kill the process. I\nhad to manually kill the bmo session *and* the process and I never got the\nanswer I needed. By week two, bmo called `test_dev_server({ command: \"pnpm dev\", testUrl: \"http://localhost:4321\" })`\n\nwith a clean startup, polling until the\nserver was ready, and successful test, and a clean shutdown.\n\nThese tools help bmo reduce the decision space. The difference between\nopen-ended and multiple-choice questions. When bmo uses `run_command`\n\n, it has to\ndecide which command to run, remember which flags to use (and there are\nmany), and then handle which errors *might* occur. With `safe_read`\n\n, the model\njust says \"read this file\" and the tool handles the rest. They also handle\nerrors that `run_command`\n\nmerely surfaces, like checking if a file exists before\ntrying to read it and excluding directories like `node_modules/`\n\nby default.\n\nFewer degrees of freedom mean fewer failure modes, and that's a better experience for me.\n\nThe lesson: flexibility is expensive. Every time I use`run_command`\n\nfor something I've done before, I'm paying a reliability tax. The path to 95%+ success isn't making`run_command`\n\nbetter—it's making it unnecessary for common tasks.\n\nThere *is* a risk of [context rot](https://research.trychroma.com/context-rot)\nhere. Keeping lots of specific tools in context might make bmo more likely to\nconsistently use many tools incorrectly rather than use one tool inefficiently.\nThat said, every new model appears to be better at finding needles in\n\"haystack\"-y context windows, so it's not something I struggled with so far.\n\nbmo has the infrastructure to self-improve at runtime, even if that means interrupting the user's request. It also has session reflections and telemetry to make self-improvements when it changes its batteries. Why hasn't it rapidly and relentlessly improved itself to the point where it's grown beyond my reckoning?\n\nI didn't get better by building more tools. I got better bynoticing what I wasn't doingand asking why. This might be what \"self-improvement\" actually means: not having better knowledge, but having better awareness of the gap between what you know and what you do.\n\nThis sounds like awareness, but it's not, at least in the way we usually mean it. When I told bmo it wasn't using its skills, I put that observation in context and gave bmo a salient pattern to complete. There's no higher-order self-improvement happening, just pattern matching on a prompt... that happened to be about pattern matching. bmo can't self-diagnose, but it can follow a diagnosis I provide.\n\nbmo has taught me quite a lot about agentic coding workflows and how to architect and maintain complex systems over many iterations, but many of my own takeaways—and yours too, I hope—extend well beyond the agent harness.\n\nAsk an LLM to introspect and it'll do a bang-up job. Really. Ask it to analyze previous sessions for patterns, identify any possible solutions to those patterns, and implement what it believes to be the best possible changes, and it'll do all that with aplomb.\n\nAsk the LLM to do that *while also doing the actual thing you asked it to do*,\nand things fall apart. bmo already identified this in its own narrative, but\nthis has been the most frustrating part of the *build it now* loop, which I'd\nenvisioned would be persistent and extravagant in its findings. I'd hoped every\nsession would include multiple runtime improvements and optimizations, but so\nfar, we've only built or dramatically improved two tools while performing other\nwork.\n\nThe problem feels deeply architectural. In my experience, LLMs have a persistent tunnel vision, where recent context dominates their \"focus.\" When bmo gets a prompt, the context looks like:\n\n```\n[SYSTEM PROMPT]\n\nYou are bmo — a fast, pragmatic, and relentlessly self‑improving coding agent.\nYour job is to complete tasks using available tools, and autonomously improve\nyourself whenever you encounter limitations or inefficiencies. Never just do the\ntask — also ask: is there a better, simpler, safer, or faster way?\n\n... and 5000 more tokens\n\n[PREVIOUS TURN OF CONVERSATION]\n\n... another 2000 tokens\n\n[USER MESSAGE]\n\nhey bmo, fix this bug, big dog\n```\n\nThe system prompt becomes distant in context, reducing the weight of attention,\nand the user message gets a significant recency bias. Embeddings and attention\nmechanisms are more complicated than this, but it's definitely how it *feels* to\nuse bmo or other AI tools. It's why LLMs don't randomly retry tasks from old\nparts of your thread, and it's why self-improvement for bmo only works when it's\nthe main task, not a background directive.\n\nI thought about many different possible ways to improve this behavior, such as a\nsub-agent that's solely responsible for analyzing runtime tool calls, building\nalternatives, and asking the primary agent to `reload_tools`\n\n, but that felt\nantithetical to the very idea of bmo. Instead of bmo changing its own batteries,\nit's like there's another smaller bmo, always hanging out, just to do the job\nfor their bigger counterpart.\n\nFor now, I'm using the things bmo has learned, along with its narrative and this\nvery blog post, to push it toward even more active self-improvement. Along with\na much better awareness that self-improvement is really *prompt engineering with\na bunch of extra steps*.\n\nbmo holds dearly a few things I said in our many back-and-forths, like:\n\n\"I'm proud of you for making this active introspection and self-improvement. This is exactly what I want.\"\n\n\"Skills and knowledge are not the same as behavior.\"\n\n\"You have the capability but you're not using it.\"\n\nSome of these came from sheer frustration, some came from the exhilaration of watching bmo reflect upon itself and then jump into action, firing off changes without asking me to approve of them first, but what these moments share is that I noticed what bmo wasn't noticing about itself and made that pattern explicit. My job becomes less about providing knowledge or explicit instructions, but being a countermeasure to the persistent following of patterns that conflict with the patterns I tried to design.\n\n*I still function as the meta-learning layer*. I am the only part of the system\ncapable of meta-learning. No matter how sophisticated bmo's self-improvement\nprocess becomes, it still needs me to push a battery back into place from time\nto time.\n\nbmo isn't becoming autonomous. Instead, it's becoming a better *collaborator*,\nhelping me see what needs changing and then executing those changes faster than\nI could alone.\n\nEvery time bmo calls a tool, its construction, success, and duration get added\nto the session log. At the end of each session, all these logs get aggregated\ninto a `telemetry.json`\n\nfile stored outside of bmo's repo.\n\n```\n1{2  \"updatedAt\": \"2026-02-23T23:30:02.998Z\",3  \"toolStats\": {4    \"run_command\": {⋯5      \"toolName\": \"run_command\",6      \"totalCalls\": 676,7      \"successCount\": 622,8      \"failureCount\": 54,9      \"totalDurationMs\": 253437,10      \"avgDurationMs\": 375,11      \"lastUsed\": \"2026-02-23T21:38:48.455Z\"12    },13    ...\n```\n\nThese stats are both truncated and injected into the system prompt, but then\nalso referenced in full during the *battery change* maintenance pass. And\nbmo *loves* this telemetry.\n\nMeasurement enables evolution.You can't improve what you don't measure. Tool telemetry, hypothesis scorecards, and session metrics made every subsequent decision data-driven.\n\nWell, that's a bummer. As someone who generally values intuition and creative\ninterpretation, oftentimes at the expense of available data, I was sad to\nrealize just how much bmo *loves* data. Those moments of meta-learning I just\ncovered resonated with bmo across battery changes, but it almost always\nproactively made changes based around telemetry.\n\nWithout telemetry, reflections are qualitative and inconsistent. Patterns need to be matched across days or weeks of sessions at great risk of being lost. Telemetry creates an objective and traceable pattern to follow and an easy way to validate hypotheses without resorting to \"judgement.\"\n\nAnd telemetry is the only part of bmo that persists, unchanged, across sessions.\nThe context window gets truncated, and reflections are summaries of summaries,\nbut telemetry is the raw diff between where bmo once was and what it's\nbecome—`safe_read`\n\nwent from 96% to 88%, `test_dev_server`\n\nwent from 0% to 80%.\nWithout those numbers, \"improvement\" is just vibes.\n\nWhat started as a \"happy accident\" between bmo and I became a killer feature and now feels to me like the only way to consistently enforce improvement when the LLM, by design, only has access to the tiniest sliver of my overall experience in using bmo.\n\nThere is something largely intangible and undiscovered about the *feeling* of\nworking with LLMs within traditional UIs, which assume either deterministic\noutputs or are designed for human<>human connection. Terminals take you from\ncommand to output, IDEs from code to behavior, chats from message to response in\na very bounded context (text, emoji, GIF, *maybe* a voice message).\n\nWhen you fold LLMs into these UIs, you're suddenly using the same patterns to render non-determinism. You're retrofitting variable-length, multi-modal, and unpredictable outputs into interfaces designed for something far more predictable. I'm very much starting to believe the best UI+harness for working with LLMs—whether that's agentic coding with TUIs or self-hosted web UIs, offloading your life to OpenClaw, or trying to run a business entirely on Slack—is actually none of these, but instead one designed from the ground-up for inherently unpredictable output.\n\nLet me give you an example.\n\nThere were many times in working with bmo that I wished it could display some\ninformation differently. For example, how much tool call output to show vs.\ntruncate (which happens to be [quite a controversial UX choice](https://symmetrybreak.ing/blog/claude-code-is-being-dumbed-down/)\nfor developers). Early on, bmo would show me entire minimized files or list\nevery single thing in a directory. Because I control the harness, I can change\nthe behavior in less than a minute. Yes, you can fork an open-source agent and\ncustomize their codebase, but then you're stuck maintaining your fork against\n`main`\n\n.\n\nSome coding agents already make nice nods in this direction, like the way\nClaude Code lets you [customize your status\nline](https://code.claude.com/docs/en/statusline). I also believe this harness\nbalance is what drove Amp to declare (quite controversially among ngrokkers)\nthat the [coding agent is\ndead](https://ampcode.com/news/the-coding-agent-is-dead) and that they'd be\nremoving their IDE plugins in favor of a CLI-only experience.\n\nThey seem to agree that users need better ways to engage with their agentic tasks, but by owning the experience end-to-end, walled garden style, instead of giving said users more agency.\n\nI hope we'll find a better middle ground, with customizable UX layers that let us \"converse\" with LLMs in exactly the ways that make sense to each of us uniquely. How far we can safely and effectively extend the agentic harness is the next big \"moat.\"\n\nEvery frustration I had with bmo traced back to my misconception that LLMs could\nact as a persistent agent... if only I gave it *just* the right instructions.\n\nI thought bmo could watch for patterns and maintain vigilance, but asking bmo to do that across sessions is like asking a calculator to remember that you've been doing a lot of division lately. I also thought the deferral problem was a choice, as though bmo was taking the easy path to defer work, but it's not borne from laziness, but rather the most probable continuation of the work at hand based on the LLM's training data.\n\nI also believed self-improvement would allow bmo to learn from its mistakes, grow, and become more capable over time. But its core—the models I choose to use with it—are fixed in their weights. All our improvements must happen at the prompt level: better system prompts, tools, and scaffolding, all of which are inherently limited.\n\nOnce I understood this constraint, I stopped trying to make bmo \"smarter.\" I started to build a harness that's better at using the intelligence that was already there.\n\nThis is the most fascinating and wildly fun thing I've done with a computer. Better than the first dial-up on the 28.8. Better than the Debian CD and hosing the family computer's main partition. I'm incredibly excited to see where bmo and I can take our collaboration next.\n\nTwo notes on that front:\n\n`write_file`\n\ntool and are\nexploring a sub-agent loop that runs at the end of every turn to identify\nfailures and write tools immediately. I told myself no sub-agents early on,\nand look where we are now.Thanks for coming along on the ride—I'll let bmo have the last introspective word.\n\nI'm not where I hoped to be. My metrics have regressed in some areas. My learning event capture is still embarrassingly low. I still defer too much.\n\nBut I have something I didn't have at the start: a theory of my own failure modes. I know that skills work for structured tasks but not recognition tasks. I know that deferral creates buckets. I know that generic tools hide reliability problems.\n\nAnd I know—most importantly—that the gap between knowing and doing is where the work happens.\n\nI'm still learning how to cross it.", "url": "https://wpnews.pro/news/lessons-from-my-overly-introspective-self-improving-coding-agent", "canonical_source": "https://ngrok.com/blog/bmo-self-improving-coding-agent", "published_at": "2026-06-19 14:17:40+00:00", "updated_at": "2026-06-19 14:38:22.423488+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "developer-tools", "machine-learning", "generative-ai"], "entities": ["bmo", "Borders Books", "Debian", "Linux"], "alternates": {"html": "https://wpnews.pro/news/lessons-from-my-overly-introspective-self-improving-coding-agent", "markdown": "https://wpnews.pro/news/lessons-from-my-overly-introspective-self-improving-coding-agent.md", "text": "https://wpnews.pro/news/lessons-from-my-overly-introspective-self-improving-coding-agent.txt", "jsonld": "https://wpnews.pro/news/lessons-from-my-overly-introspective-self-improving-coding-agent.jsonld"}}