cd /news/large-language-models/stop-shipping-ai-slop-build-an-anti-… · home topics large-language-models article
[ARTICLE · art-18791] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM

A developer argues that "AI slop"—bland, off-voice, or hallucinated text from large language models—is an engineering problem solvable by wrapping models in a validation harness rather than relying on prompt engineering. The proposed system treats the LLM as an unreliable upstream dependency, using five layers of deterministic checks including structured output schemas, denylists for error-shaped strings, and automated rejection and retry before any output reaches a user. The key insight is that most slop is detectable with cheap, deterministic code, and the biggest reduction comes from demanding structured output like JSON schemas instead of free-form prose.

read6 min publishedMay 30, 2026

"AI slop" is not a model problem. It's an engineering problem you decided not to solve.

The slop is the bland, off-voice, half-hallucinated, occasionally-just-an-error-message text that your LLM emits maybe 5% of the time — and that 5% is the part users screenshot. The instinct is to fix it in the prompt: add three more sentences of "be concise, be accurate, match my tone." That treats a stochastic system as if it were deterministic. It isn't. You cannot prompt your way to a guarantee.

What actually works is treating the model like any other unreliable upstream dependency: wrap it in a harness that validates, rejects, and retries before anything reaches a user. The model proposes; the harness disposes. Here's how to build one.

Every production LLM feature I've shipped converged on the same shape: the model is one stage in a pipeline, not the pipeline itself. You don't trust raw generation any more than you'd trust raw user input. You parse it, you validate it against constraints you can express in code, and you reject anything that fails — automatically, before a human ever sees it.

The key insight is that most slop is detectable. Empty output, a leaked stack trace, the wrong language, a 900-word answer when you asked for 200, a banned phrase like "in today's fast-paced world" — these are all checkable with deterministic code. You don't need a judge model to catch them (though a judge model has its place at the end). You need a gate that runs on every generation, costs microseconds, and never gets tired.

Think of it as five layers, each rejecting a different class of failure.

The single biggest reduction in slop comes from refusing to accept prose where you can demand structure. If you ask for a JSON object with named fields and a schema, the failure modes collapse from "infinite" to "a handful you can enumerate."

Use the provider's native structured-output / tool-calling mode and attach a real schema — Pydantic, Zod, JSON Schema, whatever your stack speaks. This does two things. First, it forces the model to commit to a shape, which kills rambling preambles ("Sure! Here's a great answer for you..."). Second, it gives you a parse step that fails loudly. If the model returns something that doesn't validate, that's not a soft warning — it's a rejected generation that triggers a retry. A parse failure is a quality signal, not an exception to swallow.

The corollary: never try/except: pass

around your parser. A swallowed parse error is slop with the lights turned off.

This one surprises people. Models are trained on the entire internet, which includes a lot of error messages, apology boilerplate, and refusal language. Under pressure — ambiguous input, a retrieval miss, a truncated context — the model will sometimes emit text that is syntactically valid but semantically garbage: "I'm sorry, I cannot access that file," "Error: undefined," "As an AI language model, I don't have the ability to...," or a half-rendered template with {{variable}}

still in it.

Structured output won't catch these, because they fit the schema fine. You need an explicit denylist of error-shaped strings and patterns, checked against every field. It's crude and it works. Maintain it like you maintain a spam filter — every time a new flavor of garbage reaches production, it earns a line in the rejection list.

This is where you encode the things that make output yours rather than generic. Most of it is deterministic and cheap:

Here's the core of a harness that strings these layers together with a bounded retry loop.

import re
from pydantic import BaseModel, ValidationError

class Article(BaseModel):
    title: str
    body: str

ERROR_SHAPES = [
    r"as an ai language model",
    r"i (?:cannot|can't|am unable to) (?:access|comply)",
    r"\berror:\s",
    r"undefined|null\b",
    r"\{\{.*?\}\}",          # leaked template tokens
]
BANNED_PHRASES = [r"in today's fast-paced", r"delve into", r"unleash the power"]

def gate(text: str) -> list[str]:
    """Deterministic checks. Returns a list of failures (empty == pass)."""
    fails = []
    if not text.strip():
        fails.append("empty output")
    if not (200 <= len(text.split()) <= 800):
        fails.append(f"length out of bounds: {len(text.split())} words")
    for pat in ERROR_SHAPES:
        if re.search(pat, text, re.I):
            fails.append(f"error-shaped string: /{pat}/")
    for pat in BANNED_PHRASES:
        if re.search(pat, text, re.I):
            fails.append(f"banned phrase: /{pat}/")
    return fails

def generate(client, prompt: str, max_attempts: int = 3) -> Article:
    last_fails: list[str] = []
    for attempt in range(max_attempts):
        feedback = "" if not last_fails else (
            "\n\nYour previous output was rejected for: "
            + "; ".join(last_fails) + ". Fix these and return only the schema."
        )
        raw = client.structured(prompt + feedback, schema=Article)  # native structured mode
        try:
            article = Article.model_validate(raw)
        except ValidationError as e:
            last_fails = [f"schema: {e.error_count()} errors"]
            continue
        last_fails = gate(article.body)
        if not last_fails:
            return article
    raise RuntimeError(f"slop after {max_attempts} attempts: {last_fails}")

Notice what the harness does on rejection: it feeds the specific failures back into the next attempt. The model is far better at fixing a named defect than at avoiding an abstract one. And notice the loop is bounded — after max_attempts

it raises rather than shipping. Failing closed is the whole point.

Layers 1–3 catch format and surface defects. Layer 4 catches semantic invariants that are specific to your task and still checkable in code. If you generate a summary, assert every cited number appears in the source. If you generate SQL, run it through a parser and an EXPLAIN

, not the model's confidence. If you generate code, compile it and run the linter. If you generate a translation, check that named entities survived.

These gates are where domain knowledge lives. They're unglamorous assert

statements, and they're the difference between a demo and a product. The rule: anything you can verify mechanically, you must — because the model will eventually get it wrong, and you want the gate to catch it, not the user.

The last layer is the only one that may use another model, and only for the things code genuinely can't judge: faithfulness, relevance, tone-match. A cheap judge model scoring "does this answer the question, grounded in the provided context, in the requested voice?" on a 1–5 scale, with a hard threshold below which you reject, catches the subtle slop that passes every deterministic check.

Keep this layer last and keep it skeptical. A judge model is itself an LLM and can be fooled, so it's a final filter on output that has already survived four deterministic gates — never a replacement for them. And log every rejection at every layer. Your rejection logs are the highest-signal dataset you own: they tell you exactly how your model fails in production, which feeds back into prompts, denylists, and gates.

None of these layers is clever. That's the point. Cleverness is fragile; a denylist and a bounded retry loop are not. What the harness gives you is a guarantee about what reaches the user — not a probability, a guarantee — for every failure mode you've chosen to encode. Slop stops being a vibe you argue about and becomes a set of named, logged, falsifiable conditions.

The model is a brilliant, unreliable intern. You don't fix an unreliable intern by writing a longer brief. You fix it by reviewing the work before it goes out.

The open question I keep circling: which of these checks genuinely belong in deterministic code, and which are you quietly outsourcing to a judge model because writing the real assertion was too hard?

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/stop-shipping-ai-slo…] indexed:0 read:6min 2026-05-30 ·