LLM output validation: 5 patterns that actually work in production

Five practical patterns for validating LLM outputs in production, addressing common failures like malformed JSON, incorrect length, and hallucinated fields. It emphasizes using schema validation with retries and error hints instead of silent error handling, providing concrete code examples for each pattern. The key approach is to parse, validate against a schema, and retry with specific error feedback to the model.

LLMs are probabilistic text generators. In a notebook demo, that's fine. In production, it means your pipeline will occasionally receive a Python dict where you expected JSON, a 900-word paragraph where you asked for three bullet points, or a hallucinated field name that breaks your downstream schema. This post is not about theory — it's about five concrete patterns, each with working code, that handle these failures reliably. You're calling an LLM API expecting structured output. The model has been prompted carefully. But over thousands of calls, you'll see: Each pattern below addresses one failure mode. import json import re import time import hashlib from openai import OpenAI llm client = OpenAI api key="your api key", base url="https://api.your-llm-provider.com/v1", def call llm messages: list dict , model: str = "gpt-4o-mini", temperature: float = 0.3, max tokens: int = 1000 - str: response = llm client.chat.completions.create model=model, messages=messages, temperature=temperature, max tokens=max tokens, return response.choices 0 .message.content.strip Problem: The model returns valid JSON 98% of the time and something subtly broken the other 2%. Your parser crashes and you lose the request. Bad solution: json.loads with a bare except that returns None . You swallow errors silently and downstream code explodes later. Good solution: Parse, validate against a schema, and retry with an error hint that tells the model exactly what went wrong. import jsonschema ARTICLE SCHEMA = { "type": "object", "required": "title", "summary", "tags", "difficulty" , "properties": { "title": {"type": "string", "minLength": 10, "maxLength": 120}, "summary": {"type": "string", "minLength": 50}, "tags": {"type": "array", "items": {"type": "string"}, "minItems": 1}, "difficulty": {"type": "string", "enum": "beginner", "intermediate", "advanced" }, }, "additionalProperties": False, } def extract json from response text: str - str: """Strip markdown code fences if present.""" match = re.search r" ?:json ?\s \s\S ? ", text if match: return match.group 1 .strip Try to find raw JSON object match = re.search r"\{ \s\S \}", text if match: return match.group 0 return text def call with json schema prompt: str, schema: dict, max retries: int = 3 - dict: messages = {"role": "system", "content": "You are a data extraction assistant. " "Always respond with valid JSON matching the requested schema. " "No prose, no markdown fences, just the JSON object." }, {"role": "user", "content": prompt}, last error = None for attempt in range max retries : raw = call llm messages json str = extract json from response raw try: data = json.loads json str jsonschema.validate instance=data, schema=schema return data except json.JSONDecodeError as e: last error = f"JSON parse error: {e}. Raw output was: {json str :200 }" except jsonschema.ValidationError as e: last error = f"Schema validation failed: {e.message}" Append error feedback and retry messages.append {"role": "assistant", "content": raw} messages.append {"role": "user", "content": f"That response had an error: {last error}\n" "Please fix it and return only the corrected JSON." } time.sleep 0.5 attempt + 1 back off slightly raise ValueError f"Failed after {max retries} attempts. Last error: {last error}" Problem: You ask for a 2-sentence summary and get a paragraph. Or you ask for 500 words and get 80. Downstream rendering breaks. Bad solution: Truncate with response :500 . You cut mid-sentence and produce garbage. Good solution: Measure, then retry with a correction hint that quantifies the delta. def count words text: str - int: return len text.split def call with length constraint prompt: str, min words: int, max words: int, max retries: int = 3 - str: messages = {"role": "system", "content": f"Write responses between {min words} and {max words} words. " "Count carefully before submitting." }, {"role": "user", "content": prompt}, for attempt in range max retries : response = call llm messages, max tokens=max words 2 word count = count words response if min words <= word count <= max words: return response delta = word count - max words if word count max words else min words - word count direction = "shorter" if word count max words else "longer" hint = f"Your response was {word count} words. " f"It needs to be {abs delta } words {direction}. " f"Target: {min words}–{max words} words. Rewrite it." messages.append {"role": "assistant", "content": response} messages.append {"role": "user", "content": hint} Last resort: hard truncate/expand with note final = call llm messages, max tokens=max words 2 words = final.split if len words max words: return " ".join words :max words return final Problem: The model consistently wraps values in prose "The severity is: HIGH" instead of returning a clean value. JSON parsing fails; you can't proceed. Good solution: Regex extraction as a structured fallback — not a replacement for JSON, but a recovery layer when JSON fails. FIELD PATTERNS = { "severity": r"\b LOW|MEDIUM|HIGH|CRITICAL \b", "score": r"\b \d+ ?:\.\d+ ? \s ?:/\s 10 ?", "category": r"\b spam|phishing|legitimate|malware|unknown \b", "confidence": r"confidence :\s + \d+ ?:\.\d+ ? %?", } def extract fields with regex text: str, fields: list str - dict: """ Attempt to extract structured fields from prose output using regex. Returns None for fields that cannot be extracted. """ result = {} text upper = text.upper for field in fields: pattern = FIELD PATTERNS.get field if not pattern: result field = None continue match = re.search pattern, text upper if field == "severity" else text, re.IGNORECASE result field = match.group 1 if match else None return result def classify with fallback text to classify: str - dict: prompt = f'Classify this text:\n\n"{text to classify}"\n\n' 'Return JSON: {"category": "spam|phishing|legitimate", ' '"severity": "LOW|MEDIUM|HIGH|CRITICAL", "confidence": 0-100}' messages = {"role": "user", "content": prompt} raw = call llm messages, temperature=0.1 try: json str = extract json from response raw return json.loads json str except json.JSONDecodeError, ValueError : Fallback: extract fields with regex extracted = extract fields with regex raw, "category", "severity", "confidence" extracted " extraction method" = "regex fallback" return extracted Problem: The model answers confidently even when it's guessing. You need a signal to route low-confidence answers to human review. Key insight: Ask the model to evaluate its own answer in a separate call. Self-evaluation in the same call is biased upward. def get answer with confidence question: str, context: str - dict: Step 1: Generate answer answer messages = {"role": "system", "content": "Answer based strictly on the provided context."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}, answer = call llm answer messages, temperature=0.2 Step 2: Evaluate in a separate call eval messages = {"role": "system", "content": "You are an impartial evaluator. Assess answer quality strictly. " "Return JSON: {\"confidence\": 0-100, \"issues\": list of concerns , " "\"grounded\": true/false}" }, {"role": "user", "content": f"Question: {question}\n\n" f"Context provided:\n{context}\n\n" f"Answer given:\n{answer}\n\n" "Evaluate: Is this answer fully supported by the context? " "Are there unsupported claims? Score 0-100." }, eval raw = call llm eval messages, temperature=0.0 try: eval data = json.loads extract json from response eval raw except json.JSONDecodeError, ValueError : eval data = {"confidence": 50, "issues": "evaluation parse failed" , "grounded": None} return { "answer": answer, "confidence": eval data.get "confidence", 50 , "issues": eval data.get "issues", , "grounded": eval data.get "grounded" , "needs review": eval data.get "confidence", 50 < 70, } Problem: You process 50 documents in batch and ask the model to extract key entities from each. You get overlapping, near-duplicate entries that pollute your downstream data. Good solution: Hash-based exact dedup combined with a lightweight similarity check for near-duplicates. from difflib import SequenceMatcher def deduplicate outputs items: list str , similarity threshold: float = 0.85 - list str : """ Remove exact duplicates hash and near-duplicates sequence similarity . Keeps the first occurrence of each unique item. """ seen hashes: set str = set unique items: list str = for item in items: normalized = item.strip .lower item hash = hashlib.md5 normalized.encode .hexdigest if item hash in seen hashes: continue exact duplicate Check near-duplicate against existing unique items is near dup = any SequenceMatcher None, normalized, existing.strip .lower .ratio = similarity threshold for existing in unique items if not is near dup: unique items.append item seen hashes.add item hash return unique items def batch extract entities documents: list str , entity type: str - list str : all entities = for doc in documents: messages = {"role": "system", "content": f"Extract all {entity type} from the text. " "Return a JSON array of strings. Nothing else." }, {"role": "user", "content": doc}, raw = call llm messages, temperature=0.1 try: entities = json.loads extract json from response raw if isinstance entities, list : all entities.extend entities except json.JSONDecodeError, ValueError : pass log and continue — one bad doc shouldn't stop the batch return deduplicate outputs all entities These patterns compose. A production pipeline for classifying user-submitted content might chain them: def robust classify text: str - dict: try: result = call with json schema prompt=f'Classify this text: "{text}"', schema={ "type": "object", "required": "category", "severity", "confidence" , "properties": { "category": {"type": "string", "enum": "spam", "phishing", "legitimate", "toxic" }, "severity": {"type": "string", "enum": "LOW", "MEDIUM", "HIGH", "CRITICAL" }, "confidence": {"type": "number", "minimum": 0, "maximum": 100}, }, }, max retries=3, except ValueError: Pattern 3 fallback result = classify with fallback text Pattern 4: flag for human review if uncertain result "needs review" = result.get "confidence", 100 < 65 return result These five patterns cover the vast majority of production failures. Start with Pattern 1 JSON schema + retry and Pattern 3 regex fallback — they handle 80% of output issues. Add Pattern 4 self-evaluation when you have a human review queue and need to route intelligently. For content pipelines like the moderation system described in practical security guides, Patterns 1 and 5 together eliminate most of the noise from batch LLM processing.