{"slug": "stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm", "title": "Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM", "summary": "A developer argues that \"AI slop\"—bland, off-voice, or hallucinated text from large language models—is an engineering problem solvable by wrapping models in a validation harness rather than relying on prompt engineering. The proposed system treats the LLM as an unreliable upstream dependency, using five layers of deterministic checks including structured output schemas, denylists for error-shaped strings, and automated rejection and retry before any output reaches a user. The key insight is that most slop is detectable with cheap, deterministic code, and the biggest reduction comes from demanding structured output like JSON schemas instead of free-form prose.", "body_md": "\"AI slop\" is not a model problem. It's an engineering problem you decided not to solve.\n\nThe slop is the bland, off-voice, half-hallucinated, occasionally-just-an-error-message text that your LLM emits maybe 5% of the time — and that 5% is the part users screenshot. The instinct is to fix it in the prompt: add three more sentences of \"be concise, be accurate, match my tone.\" That treats a stochastic system as if it were deterministic. It isn't. You cannot prompt your way to a guarantee.\n\nWhat actually works is treating the model like any other unreliable upstream dependency: wrap it in a harness that validates, rejects, and retries before anything reaches a user. The model proposes; the harness disposes. Here's how to build one.\n\nEvery production LLM feature I've shipped converged on the same shape: the model is one stage in a pipeline, not the pipeline itself. You don't trust raw generation any more than you'd trust raw user input. You parse it, you validate it against constraints you can express in code, and you reject anything that fails — automatically, before a human ever sees it.\n\nThe key insight is that most slop is *detectable*. Empty output, a leaked stack trace, the wrong language, a 900-word answer when you asked for 200, a banned phrase like \"in today's fast-paced world\" — these are all checkable with deterministic code. You don't need a judge model to catch them (though a judge model has its place at the end). You need a gate that runs on every generation, costs microseconds, and never gets tired.\n\nThink of it as five layers, each rejecting a different class of failure.\n\nThe single biggest reduction in slop comes from refusing to accept prose where you can demand structure. If you ask for a JSON object with named fields and a schema, the failure modes collapse from \"infinite\" to \"a handful you can enumerate.\"\n\nUse the provider's native structured-output / tool-calling mode and attach a real schema — Pydantic, Zod, JSON Schema, whatever your stack speaks. This does two things. First, it forces the model to commit to a shape, which kills rambling preambles (\"Sure! Here's a great answer for you...\"). Second, it gives you a parse step that *fails loudly*. If the model returns something that doesn't validate, that's not a soft warning — it's a rejected generation that triggers a retry. A parse failure is a quality signal, not an exception to swallow.\n\nThe corollary: never `try/except: pass`\n\naround your parser. A swallowed parse error is slop with the lights turned off.\n\nThis one surprises people. Models are trained on the entire internet, which includes a lot of error messages, apology boilerplate, and refusal language. Under pressure — ambiguous input, a retrieval miss, a truncated context — the model will sometimes emit text that is *syntactically valid* but semantically garbage: \"I'm sorry, I cannot access that file,\" \"Error: undefined,\" \"As an AI language model, I don't have the ability to...,\" or a half-rendered template with `{{variable}}`\n\nstill in it.\n\nStructured output won't catch these, because they fit the schema fine. You need an explicit denylist of error-shaped strings and patterns, checked against every field. It's crude and it works. Maintain it like you maintain a spam filter — every time a new flavor of garbage reaches production, it earns a line in the rejection list.\n\nThis is where you encode the things that make output *yours* rather than generic. Most of it is deterministic and cheap:\n\nHere's the core of a harness that strings these layers together with a bounded retry loop.\n\n``` python\nimport re\nfrom pydantic import BaseModel, ValidationError\n\nclass Article(BaseModel):\n    title: str\n    body: str\n\nERROR_SHAPES = [\n    r\"as an ai language model\",\n    r\"i (?:cannot|can't|am unable to) (?:access|comply)\",\n    r\"\\berror:\\s\",\n    r\"undefined|null\\b\",\n    r\"\\{\\{.*?\\}\\}\",          # leaked template tokens\n]\nBANNED_PHRASES = [r\"in today's fast-paced\", r\"delve into\", r\"unleash the power\"]\n\ndef gate(text: str) -> list[str]:\n    \"\"\"Deterministic checks. Returns a list of failures (empty == pass).\"\"\"\n    fails = []\n    if not text.strip():\n        fails.append(\"empty output\")\n    if not (200 <= len(text.split()) <= 800):\n        fails.append(f\"length out of bounds: {len(text.split())} words\")\n    for pat in ERROR_SHAPES:\n        if re.search(pat, text, re.I):\n            fails.append(f\"error-shaped string: /{pat}/\")\n    for pat in BANNED_PHRASES:\n        if re.search(pat, text, re.I):\n            fails.append(f\"banned phrase: /{pat}/\")\n    return fails\n\ndef generate(client, prompt: str, max_attempts: int = 3) -> Article:\n    last_fails: list[str] = []\n    for attempt in range(max_attempts):\n        feedback = \"\" if not last_fails else (\n            \"\\n\\nYour previous output was rejected for: \"\n            + \"; \".join(last_fails) + \". Fix these and return only the schema.\"\n        )\n        raw = client.structured(prompt + feedback, schema=Article)  # native structured mode\n        try:\n            article = Article.model_validate(raw)\n        except ValidationError as e:\n            last_fails = [f\"schema: {e.error_count()} errors\"]\n            continue\n        last_fails = gate(article.body)\n        if not last_fails:\n            return article\n    raise RuntimeError(f\"slop after {max_attempts} attempts: {last_fails}\")\n```\n\nNotice what the harness does on rejection: it feeds the *specific failures* back into the next attempt. The model is far better at fixing a named defect than at avoiding an abstract one. And notice the loop is bounded — after `max_attempts`\n\nit raises rather than shipping. Failing closed is the whole point.\n\nLayers 1–3 catch format and surface defects. Layer 4 catches *semantic* invariants that are specific to your task and still checkable in code. If you generate a summary, assert every cited number appears in the source. If you generate SQL, run it through a parser and an `EXPLAIN`\n\n, not the model's confidence. If you generate code, compile it and run the linter. If you generate a translation, check that named entities survived.\n\nThese gates are where domain knowledge lives. They're unglamorous `assert`\n\nstatements, and they're the difference between a demo and a product. The rule: anything you can verify mechanically, you must — because the model will eventually get it wrong, and you want the gate to catch it, not the user.\n\nThe last layer is the only one that may use another model, and only for the things code genuinely can't judge: faithfulness, relevance, tone-match. A cheap judge model scoring \"does this answer the question, grounded in the provided context, in the requested voice?\" on a 1–5 scale, with a hard threshold below which you reject, catches the subtle slop that passes every deterministic check.\n\nKeep this layer last and keep it skeptical. A judge model is itself an LLM and can be fooled, so it's a final filter on output that has already survived four deterministic gates — never a replacement for them. And log every rejection at every layer. Your rejection logs are the highest-signal dataset you own: they tell you exactly how your model fails in production, which feeds back into prompts, denylists, and gates.\n\nNone of these layers is clever. That's the point. Cleverness is fragile; a denylist and a bounded retry loop are not. What the harness gives you is a *guarantee about what reaches the user* — not a probability, a guarantee — for every failure mode you've chosen to encode. Slop stops being a vibe you argue about and becomes a set of named, logged, falsifiable conditions.\n\nThe model is a brilliant, unreliable intern. You don't fix an unreliable intern by writing a longer brief. You fix it by reviewing the work before it goes out.\n\nThe open question I keep circling: which of these checks genuinely belong in deterministic code, and which are you quietly outsourcing to a judge model because writing the real assertion was too hard?", "url": "https://wpnews.pro/news/stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm", "canonical_source": "https://dev.to/turacthethinker/stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm-273b", "published_at": "2026-05-30 21:48:26+00:00", "updated_at": "2026-05-30 22:11:31.975912+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "generative-ai", "ai-products", "mlops"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm", "markdown": "https://wpnews.pro/news/stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm.md", "text": "https://wpnews.pro/news/stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm.txt", "jsonld": "https://wpnews.pro/news/stop-shipping-ai-slop-build-an-anti-slop-harness-around-your-llm.jsonld"}}