# I have not written a line of code in five months

> Source: <https://blog.grod.es/i-have-not-written-a-line-of-code-in-five-months>
> Published: 2026-06-20 10:35:11+00:00

During the last five months, I think I have barely written a line of code by hand, and I say this for good and for bad, because using AI for programming now feels obviously faster in many cases, to the point where manual coding can feel a bit absurd sometimes. You explain the intention, guide the architecture, review the output, ask for changes, ask for tests, ask for a refactor, and suddenly something that would have taken days starts taking hours.

At the same time, the engineering work has not disappeared; it has moved somewhere else. You are still responsible for the code. You still need architecture, tests, taste, and the ability to understand what is being produced. I would even say all of that matters more now, because generating a lot of code very quickly also means generating a lot of garbage very quickly if you don't know how to guide the process.

I am not trying to write a grand theory of AI here, only my honest view after using LLMs seriously for work projects and personal projects, and after pushing them further than I expected to push them.

[Smart and dumb at the same time](#smart-and-dumb-at-the-same-time)

The most frustrating part for me is how unstable the experience feels. You can ask a model to reason about a complex codebase, propose a structure, connect ideas that you had not connected, and sometimes it will one-shot something better than what you had in mind. Five minutes later, you ask it to do one tiny deterministic thing and it fails in the stupidest possible way.

I am not exaggerating here; for me, this has become the normal experience. It feels like working with something that can be one or two levels smarter than you in one moment, and completely unable to follow a simple instruction in the next one. It can design a decent architecture for a service, and then get stuck doing workaround after workaround for a small bug that you would have fixed manually in ten minutes.

For that reason, I don't like the simple question of whether AI is good or bad for coding; the answer depends too much on which part of the work you are looking at. In some dimensions it is very good, and in others it is absurdly bad. If you want to use it seriously, you need to learn where those borders are.

[Prompting is not programming](#prompting-is-not-programming)

One of the things I built during this period was an incident investigation bot. The idea was simple: given an alert, the bot should inspect metrics, logs, traces, dashboards, and downstream dependencies until it reaches a useful hypothesis about the root cause.

My main problem with LLMs in this kind of task is laziness. They tend to stop at the first plausible explanation. If an endpoint is failing because a downstream dependency is failing, the model will often say: "the downstream dependency is failing", and stop there. The problem is that a downstream failure is usually only the next node in the investigation, not the actual root cause.

A human keeps going to the downstream service, checks its logs, checks its traces, checks whether it has another downstream dependency, and keeps following the error until reaching a dead end, or at least until reaching the deepest useful explanation.

So I tried to encode that into the prompt. I described the investigation as a graph. Services are nodes. Signals are edges. A log, a metric spike, a span error, a dependency failure: all of those are edges that let you move to another node. The bot has to keep traversing the graph until the path stops being useful.

The bot got much better after that, although new problems appeared immediately. Sometimes it would over-traverse. Sometimes it would treat the deepest branch as the root cause, even when another branch was more directly related to the user impact. Sometimes fixing one failure mode in the prompt would break another one.

The strange thing with prompts is that they are not code and they are not deterministic. You cannot write a prompt the same way you write a function and expect the same input to always produce the same behavior. You are not programming a machine in the classical sense. You are shaping the behavior of a probability machine, and the more instructions you add, the more strange interactions you can create.

I don't think this problem is solved. If it were solved, prompt injection and jailbreaks would not exist in the way they exist today. A longer prompt is not necessarily better. Often, what you really want is a prompt that communicates the final objective clearly enough for the model to keep rediscovering the right behavior by itself. Finding that sentence is very hard.

[Context is everything](#context-is-everything)

The next thing I learned, or maybe relearned, was that context matters more than the prompt itself. The model answers your last message through everything around it: its weights, your prompt, the system instructions, and the whole current context of the conversation.

When the context is clean, it behaves better. When the context is full of trash, half-finished attempts, wrong assumptions, noisy tool outputs, and stale conclusions, it degrades.

Todo lists are one of the simplest ways I have found to control the behavior of an agent. The UI is cute, but the important part is the control mechanism behind it.

When I want an agent to do a big task, I don't just say "do it". I first ask it to plan with me, convert that plan into exhaustive checklist items, mark each item as done while it works, and continue until there are no checkboxes left. Small trick, big difference in behavior. The checklist prevents the model from stopping early, gives the agent a local memory of what it is doing, and gives me a way to inspect whether the plan is sane before the agent starts producing code.

For very long sessions, the plan becomes even more important. The model needs to compact its context from time to time, and before doing that it should write back the important findings, decisions, open questions, and completed work into the plan. After compaction or restart, the plan becomes the brain of the session. The agent wakes up again, reads the plan, and remembers who it is. I know this sounds ridiculous, but after using it for long-running tasks it becomes very clear why it works.

[Tools are a way of protecting context](#tools-are-a-way-of-protecting-context)

The context window is precious, and filling it with garbage makes everything worse. Tools matter so much for exactly this reason.

If an agent needs to rename a symbol, I don't want it reading the whole file as text, generating the whole file again, and hoping that the patch applies. I want it to use an LSP, the same way my editor does: ask for the symbol, rename it, inspect references, and move on.

Something similar happens with MCPs. I know MCP sounds new, but in my head MCP is basically SOAP for LLMs. SOAP was this old way of exposing self-documented APIs through XML. MCP is a way of exposing tools to an LLM with enough structure that the model can discover what exists, understand the input schema, and call the right method.

For me, the useful part of MCP is very boring: the model can interact with the world without needing the whole world copied into the prompt. The most useful tools are often tiny utilities built specifically to reduce context waste, not huge integrations. For the incident bot, traces were a good example. A trace can be massive: thousands of lines of JSON, spans, attributes, metadata, timestamps, and repeated fields. But for an investigation, the first useful thing is often much smaller: which spans are in error, how the error propagated, and which service boundaries are involved.

So I built a small utility that takes the raw trace JSON and produces a compact ASCII tree of the error path. Instead of giving the model 12,000 lines of JSON, I give it a few lines that show the shape of the failure. If it needs more detail, it can inspect the original span later.

That utility is boring, but it made a huge difference, because good agent tooling is often about the shape of the information more than about the amount of information.

[The model can raise your level, but it cannot replace your taste](#the-model-can-raise-your-level-but-it-cannot-replace-your-taste)

My current mental model is that an LLM can raise you one or two levels in an area. If you are a 2 out of 10 in something, maybe with an LLM you can produce something that looks like a 3 or a 4.

The usefulness has a dangerous side: if you are bad at judging the result, you will still produce bad work, only now it will look better to you than what you could have produced alone.

If you don't know Terraform, the model can generate Terraform for you. But you still need to know enough to recognize duplication, bad abstractions, broken state assumptions, weird module boundaries, and hidden operational risk. If you don't know Go, the model can generate Go for you. But you still need to know enough to recognize bad error handling, confused ownership, unnecessary interfaces, leaky abstractions, and tests that don't actually test anything.

The model does not remove the need for expertise; it amplifies the expertise you already have, and I started using multiple conversations against each other because of this. One agent writes the code. Another agent reviews it harshly. Another one checks it against my guidelines. Another one tries to simplify it. Sometimes I ask the same model to behave as if the previous output was produced by a very dumb AI and it has to find everything wrong with it.

Surprisingly, this works quite well. I don't think the model becomes more truthful because of this. I think the role and the context change the distribution of the answer. If you ask it "is this good?", it will often be too nice. If you ask it for a harsh code review and tell it to assume the code was generated by a lazy AI, it finds much more. A bit stupid, but useful.

[Speed creates a new problem: understanding](#speed-creates-a-new-problem-understanding)

The first versions of some of my projects appeared absurdly fast. In the past, speed was mostly limited by typing, searching, boilerplate, and the time needed to discover APIs. With LLMs, that part almost disappears.

The bottleneck moves to understanding what now exists. You can wake up with thousands of lines of code that were generated while you were sleeping, which is impressive and dangerous for exactly the same reason.

If something breaks, your first instinct may be to ask the agent for one more fix, and after that another one, and after that another one. At some point you are not engineering anymore. You are prompting around a codebase you don't understand.

A strong test suite saves you there. If you use AI to move faster, you should also use AI to create more tests than you would have written manually. Every bug found should become a test. Every weird piece of business logic should have comments and regression coverage. Every refactor should be backed by something stronger than "the agent says it is fine".

Refactoring helps with the same problem. An LLM will usually not stop and say: "dude, this code is becoming horrible, let's clean it up". If you ask it to add a feature, it will add the feature on top of the current state. It will try to reach the end result, because that is what you asked for.

So you need to explicitly ask for refactoring passes. Ask it to find duplication. Ask it where responsibilities are leaking. Ask it which abstractions are fake. Ask it what can be deleted. Ask it where the tests are too coupled to implementation details.

For me, one of the best uses of AI is doing the boring cleanup that everybody knows should happen and nobody wants to spend a week doing. Before AI, large refactors often died because they were too annoying. After AI, the annoying part is much smaller. The refactor still needs review and tests, of course, although the cost of attempting it is lower.

[Incident investigation is mostly graph traversal](#incident-investigation-is-mostly-graph-traversal)

The incident bot made me realize something obvious: when we investigate production issues, we are usually navigating a graph.

You start with an alert. That alert points to a service, an endpoint, a queue, a database, or some business metric. In practice, that becomes your first node.

From there, you look for signals: logs, metrics, traces, deployment events, saturation, dependency errors, feature flags, configuration changes. Each signal gives you an edge to another node.

You follow the edge and now you are in another service, another dependency, another metric, another part of the system. You repeat the process until you reach something that explains the impact well enough.

Thinking like this helped me write a better prompt, although it also made something else painfully clear: we only need this kind of bot because our observability is not good enough.

If traces were complete, consistent, and connected across services, many investigations would be much easier. You would open the trace, follow the failing span, and see the path. But in real systems, traces are not always connected. Logs don't always have the right attributes. Metrics don't always share the same labels. Dashboards are sometimes empty, misleading, duplicated, or owned by nobody. Some services follow OpenTelemetry conventions. Others don't. Some dependencies are visible. Others disappear at the boundary.

The bot has to compose a picture from broken pieces, which can be useful, but it is still a symptom. An AI incident bot is only as good as your telemetry. If the telemetry lies, the bot will reason from lies. If the traces are disconnected, the bot will invent bridges or stop too early. If ownership is unclear, the bot will not magically know who should act. If alerts are noisy, it will waste time on noise. A better bot can help, but the real fix is better observability.

[Token usage is part of the design](#token-usage-is-part-of-the-design)

Long conversations are expensive, not only in money but in attention. Every time you send a new message, the system has to process the relevant context again. Caching helps, but cache can break. Tool outputs, timestamps, changed messages, and client behavior can all make the provider recompute more than you expect.

Shorter sessions sometimes work better because long sessions, while useful for some work, also accumulate dirt.

For small tasks, I prefer starting fresh. For big tasks, keep a plan file, compact aggressively, and make sure the durable memory is outside the chat. The source of truth should be the plan, the tests, the code, and the documentation, not the chat itself.

The model can talk to itself for a while, and that can improve the answer. Reasoning tokens exist because of that. But there is a sweet spot: more thinking does not always improve the result, and faster answers are not automatically better either.

The useful question is less "how do I make the model faster?" and more "what is the minimum reasoning path that still gives me a good answer for this task?"

Sometimes you want the model to think deeply; other times you want a small deterministic edit and nothing else. The skill is knowing which mode you are asking for.

[So, should we use AI to code?](#so-should-we-use-ai-to-code)

Yes, but not as a replacement for engineering. Use it to move faster. Use it to explore. Use it to generate tests. Use it to refactor boring code. Use it to critique your own work. Use it to build small tools that would otherwise not exist. Use it to turn vague ideas into prototypes. Use it to make yourself more ambitious.

Just don't use it as an excuse to stop understanding. The worst version of AI-assisted programming is a developer prompting changes into a codebase they no longer understand, trusting green tests that don't mean much, and accumulating abstractions that nobody would have written by hand.

The version that makes sense to me is the one where the developer still owns the architecture, still owns the trade-offs, still reads the important parts, still writes or reviews the tests, still simplifies, still deletes, still asks whether the thing should exist at all.

The code may be generated by the model, but the responsibility is still yours. Maybe I have not written a line of code in five months, but I have been coding in another place: in the architecture, in the prompts, in the tools, in the tests, in the refactors, in the shape of the context, and in the taste required to decide when the machine is right and when it is just confidently producing slop.

[Me again](#me-again)

This part is no longer generated by AI from the presentation transcript. I just wanted to add that, depending on the day, I feel bullish or bearish on AI: [https://notas.grod.es/the-rain-spell](https://notas.grod.es/the-rain-spell). Some days I feel excited about the future we are building with it, and other days I feel honestly depressed about it. I am still not convinced about what that future will look like, or even whether at some point I will decide to reject AI completely and go back to writing all code by myself.

And just for fun, sometimes I ask AI to "harshly criticize" the work I do, here is what it said about this "article":

It is a bit too forgiving of AI-assisted coding. The article says responsibility remains with the engineer, but it underplays the real danger: once you stop reading most of the generated code, you can easily become a manager of slop instead of a programmer. The "LLMs raise you one or two levels" idea is useful, but also risky, because in weak areas the model may only make bad judgment look more convincing. The incident-bot section is the strongest because it admits the real problem is bad observability, not lack of AI. I would make the article more suspicious of its own excitement.
