Slop is code you can't work with

The term "slop" is increasingly applied to codebases, defined as an LLM-generated codebase that no person or AI understands well enough to work with effectively, where adding features or fixing bugs takes more time and tokens than expected and frequently introduces unexpected issues. The concept is subjective and based on expectation, with a key indicator being the "WTFs per minute" experienced while working with the code, often hitting a "slop wall" where rapid progress suddenly stalls.

People use the term 'slop' https://simonwillison.net/2024/May/8/slop/ more and more. Originally, this was mainly for content like blog articles, but now it's often used for codebases too. Like 'vibe coding', everyone has a slightly different definition of what it means. Is all AI-generated code slop? Can human-written code be slop? Is it slop if it works? I think it's fine for everyone to have their own definition of slop. After all, one man's slop is another man's $1M ARR business as the famous saying goes. But shared definitions are also useful, so if I call something slop I probably mean something like A usually LLM-generated codebase that no person or LLM understands well enough to work with effectively. Adding features takes more time and more tokens than you would expect. Fixing bugs takes more time and tokens than you would expect. Fixing bugs or adding features frequently introduces more bugs, and often ones that you wouldn't expect. I put "usually" in brackets because I think the concept is useful for non-AI-generated code too. I really like this description https://news.ycombinator.com/item?id=18442941 of how hard it is to work on the 25 million lines of C that make up the Oracle Database, for example. While agents have made slop a lot more prevalent, I'm not convinced that how the slop came to be is a useful distinction. It's not an objective definition. You can't look at a codebase, run some analysis tools, and decide if it's slop or not, but usually after trying to work with any given codebase it doesn't take me long to assign my own personal 'slop' label to it. It's also based around expectation , and I think that's the bit that makes the LLM-generated part more important. If you want to add a feature to the 25-million-line Oracle database you know it's going to take a long time, and you can plan accordingly. But when you're working with slop, you can't plan because things break in places that you're not expecting. 'Perplexity' is an https://dictionary.cambridge.org/dictionary/english/perplexity overloaded https://en.wikipedia.org/wiki/Perplexity term https://www.perplexity.ai/ in ML and AI, but it's the word I often think of when evaluating whether something is slop. How perplexed am I, and how often? WTFs per minute while vibe coding I like the idea that the only valid measure of code quality is WTFs per minute https://www.osnews.com/story/19266/wtfsm/ . I don't think that's literally true, but I do think it's a useful heuristic. It's also something I notice while vibe coding. The WTFs per minute aren't from reading the code, but from reading agent output while it works with the code. Most people who have used LLMs to code are probably familiar with a pattern of going from a prototype to slop. You have an idea and you decide to see if the agents can build it. You start from a blank slate and your agent makes incredibly fast progress at the start. Within a few hours, you have a working prototype of a system that would have taken you days or years to code in the pre-LLM days. You get to the 80% mark and you're already thinking about all the things you can do with this system that you've just conjured into existence while you polish out the few remaining issues. Then, suddenly, it's 3 days later and you've gotten to 82%. At some point, you're not sure exactly when, you hit the slop wall. You just pushed out 8 features that you thought were 'medium complex' in 15 minutes, so you quickly line up another 8. The first one you think is 'easy'. You expect the agent to bash it out in 2 minutes and you barely look at the output while reaching for the next one to start working on. And then you notice something strange. The agent hasn't actually completed yet. The token count is still spinning, you're up to 5 minutes. What is this thing doing? You interrupt it to have a conversation but you don't understand what it's saying. You ask it to simplify, and it still makes no sense. You're up to 6 WTFs per minute and rising. Eventually you let it continue. You just want to get this thing out the door so you can move on to the next one, which is more important anyway. The agent did some big refactoring thing, so you decide to try to preview your Trello clone or whatever super unique and creative project you're working on, and it crashes. Noticing that something is slop can be gradual Because part of what makes a codebase slop is breaking expectations, it's hard to say exactly at which point a codebase becomes slop. Sometimes surprises are just a normal part of software engineering. And sometimes you can have an unlucky streak and find five or six surprising negative things in a row. But by the time it becomes the default and you're not even that surprised anymore when the agent can't do something that "should be easy", then you know you're working with slop. Slop can be surprisingly good when used as a spec Engineers famously want to rewrite things, and it's famously almost always a bad idea https://skamille.medium.com/avoiding-the-rewrite-trap-b1283b8dd39e . But now the calculus has changed If it took you a year to build an alpha, you should usually try to evolve that into a beta rather than rewriting it. But if you can build an alpha every day for 10 days, then you can quite efficiently encode a lot of mistakes in 10 days of work. Often when I'm building a new smallish project with agents, I'll work on something for a few hours spread over several days, until I feel like I've hit the slop wall. Then I'll tell the agent to start again. It can have v1 as input, it can steal the hard-won lessons we've learned, use it to avoid making the same mistakes and bugs we've already fixed, but it can't copy the slop. We start with a new fundamental design difference, and write from scratch in a new folder, using the old version only as a reference to speedrun our mistakes from the previous version. Some people spend hours in /plan mode with markdown files, trying to speedrun a waterfall-type setup where the agent has everything it needs before it starts coding. But if code is cheap, why not use code as the spec? In a recent side project, I got it to a for now state I'm happy with where it solves the problems I needed it to solve in five versions spread over a few days. - V1 looked great until we hit the slop wall and started seeing unexpected bugs everywhere - V2 never shipped. We took a different approach and hit the slop wall faster. - V3 was looking good until I realized that the agent had ignored my instructions to keep it independent and had hard dependencies on some database tables from v1. - V4 never even ran and we abandoned it after about 30 minutes. - V5 is working now, with a different data structure than v1 that is a lot lighter. It makes sense to me and my agents, and so far we can add features and fix bugs in about the time I would expect. Task estimation with LLMs is hard After using Claude Code, Codex, Pi, and Amp with Opus, Sonnet, GPT, Kimi, and DeepSeek and even a little bit of Gemini, but I was never sure what it was called or how to buy it and I've dropped it completely since Google's recent 19th? pivot for several hours a day for the last year, I think the thing that still constantly surprises me is how hard estimation is. I'm still regularly surprised in both directions, e.g. I discover - An agent can easily do something that I thought an agent would find difficult or impossible - An agent struggles or fails at doing something that I thought an agent would find easy So when I'm thinking about WTFs per minute or my perplexity factor to decide if a codebase is slop, I'm taking into account that my average perplexity hums along at a very elevated baseline. This is part of what makes vibe coding addictive: the slot machine element of not knowing if you're going to be wonderfully surprised or horribly disappointed on each push of the button. But that doesn't mean that human perplexity stops being a useful metric of how to effectively work with AI, you just have to think about the rate-of-change of your WTFs per minute, rather than trying to find a single number that works across all projects, or all timeframes on a single project.