Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM

A developer argues that "AI slop"—bland, off-voice, or hallucinated text from large language models—is an engineering problem solvable by wrapping models in a validation harness rather than relying on prompt engineering. The proposed system treats the LLM as an unreliable upstream dependency, using five layers of deterministic checks including structured output schemas, denylists for error-shaped strings, and automated rejection and retry before any output reaches a user. The key insight is that most slop is detectable with cheap, deterministic code, and the biggest reduction comes from demanding structured output like JSON schemas instead of free-form prose.

"AI slop" is not a model problem. It's an engineering problem you decided not to solve. The slop is the bland, off-voice, half-hallucinated, occasionally-just-an-error-message text that your LLM emits maybe 5% of the time — and that 5% is the part users screenshot. The instinct is to fix it in the prompt: add three more sentences of "be concise, be accurate, match my tone." That treats a stochastic system as if it were deterministic. It isn't. You cannot prompt your way to a guarantee. What actually works is treating the model like any other unreliable upstream dependency: wrap it in a harness that validates, rejects, and retries before anything reaches a user. The model proposes; the harness disposes. Here's how to build one. Every production LLM feature I've shipped converged on the same shape: the model is one stage in a pipeline, not the pipeline itself. You don't trust raw generation any more than you'd trust raw user input. You parse it, you validate it against constraints you can express in code, and you reject anything that fails — automatically, before a human ever sees it. The key insight is that most slop is detectable . Empty output, a leaked stack trace, the wrong language, a 900-word answer when you asked for 200, a banned phrase like "in today's fast-paced world" — these are all checkable with deterministic code. You don't need a judge model to catch them though a judge model has its place at the end . You need a gate that runs on every generation, costs microseconds, and never gets tired. Think of it as five layers, each rejecting a different class of failure. The single biggest reduction in slop comes from refusing to accept prose where you can demand structure. If you ask for a JSON object with named fields and a schema, the failure modes collapse from "infinite" to "a handful you can enumerate." Use the provider's native structured-output / tool-calling mode and attach a real schema — Pydantic, Zod, JSON Schema, whatever your stack speaks. This does two things. First, it forces the model to commit to a shape, which kills rambling preambles "Sure Here's a great answer for you..." . Second, it gives you a parse step that fails loudly . If the model returns something that doesn't validate, that's not a soft warning — it's a rejected generation that triggers a retry. A parse failure is a quality signal, not an exception to swallow. The corollary: never try/except: pass around your parser. A swallowed parse error is slop with the lights turned off. This one surprises people. Models are trained on the entire internet, which includes a lot of error messages, apology boilerplate, and refusal language. Under pressure — ambiguous input, a retrieval miss, a truncated context — the model will sometimes emit text that is syntactically valid but semantically garbage: "I'm sorry, I cannot access that file," "Error: undefined," "As an AI language model, I don't have the ability to...," or a half-rendered template with {{variable}} still in it. Structured output won't catch these, because they fit the schema fine. You need an explicit denylist of error-shaped strings and patterns, checked against every field. It's crude and it works. Maintain it like you maintain a spam filter — every time a new flavor of garbage reaches production, it earns a line in the rejection list. This is where you encode the things that make output yours rather than generic. Most of it is deterministic and cheap: Here's the core of a harness that strings these layers together with a bounded retry loop. python import re from pydantic import BaseModel, ValidationError class Article BaseModel : title: str body: str ERROR SHAPES = r"as an ai language model", r"i ?:cannot|can't|am unable to ?:access|comply ", r"\berror:\s", r"undefined|null\b", r"\{\{. ?\}\}", leaked template tokens BANNED PHRASES = r"in today's fast-paced", r"delve into", r"unleash the power" def gate text: str - list str : """Deterministic checks. Returns a list of failures empty == pass .""" fails = if not text.strip : fails.append "empty output" if not 200 <= len text.split <= 800 : fails.append f"length out of bounds: {len text.split } words" for pat in ERROR SHAPES: if re.search pat, text, re.I : fails.append f"error-shaped string: /{pat}/" for pat in BANNED PHRASES: if re.search pat, text, re.I : fails.append f"banned phrase: /{pat}/" return fails def generate client, prompt: str, max attempts: int = 3 - Article: last fails: list str = for attempt in range max attempts : feedback = "" if not last fails else "\n\nYour previous output was rejected for: " + "; ".join last fails + ". Fix these and return only the schema." raw = client.structured prompt + feedback, schema=Article native structured mode try: article = Article.model validate raw except ValidationError as e: last fails = f"schema: {e.error count } errors" continue last fails = gate article.body if not last fails: return article raise RuntimeError f"slop after {max attempts} attempts: {last fails}" Notice what the harness does on rejection: it feeds the specific failures back into the next attempt. The model is far better at fixing a named defect than at avoiding an abstract one. And notice the loop is bounded — after max attempts it raises rather than shipping. Failing closed is the whole point. Layers 1–3 catch format and surface defects. Layer 4 catches semantic invariants that are specific to your task and still checkable in code. If you generate a summary, assert every cited number appears in the source. If you generate SQL, run it through a parser and an EXPLAIN , not the model's confidence. If you generate code, compile it and run the linter. If you generate a translation, check that named entities survived. These gates are where domain knowledge lives. They're unglamorous assert statements, and they're the difference between a demo and a product. The rule: anything you can verify mechanically, you must — because the model will eventually get it wrong, and you want the gate to catch it, not the user. The last layer is the only one that may use another model, and only for the things code genuinely can't judge: faithfulness, relevance, tone-match. A cheap judge model scoring "does this answer the question, grounded in the provided context, in the requested voice?" on a 1–5 scale, with a hard threshold below which you reject, catches the subtle slop that passes every deterministic check. Keep this layer last and keep it skeptical. A judge model is itself an LLM and can be fooled, so it's a final filter on output that has already survived four deterministic gates — never a replacement for them. And log every rejection at every layer. Your rejection logs are the highest-signal dataset you own: they tell you exactly how your model fails in production, which feeds back into prompts, denylists, and gates. None of these layers is clever. That's the point. Cleverness is fragile; a denylist and a bounded retry loop are not. What the harness gives you is a guarantee about what reaches the user — not a probability, a guarantee — for every failure mode you've chosen to encode. Slop stops being a vibe you argue about and becomes a set of named, logged, falsifiable conditions. The model is a brilliant, unreliable intern. You don't fix an unreliable intern by writing a longer brief. You fix it by reviewing the work before it goes out. The open question I keep circling: which of these checks genuinely belong in deterministic code, and which are you quietly outsourcing to a judge model because writing the real assertion was too hard?