Structured Outputs: How We Stopped Parsing LLM Responses by Hand

A developer describes how their team uses OpenAI's structured outputs feature to enforce JSON schema compliance in LLM responses, eliminating parsing failures in production Django applications. By passing Pydantic models directly to the API, they obtain typed data objects instead of raw strings, which they use in a document processing pipeline for contract extraction.

Every team we talk to has a version of the same story. They built an LLM integration that works well in testing. Then, three weeks into production, something comes back slightly different — the model wraps the JSON in a code block, or uses "status": "Completed" instead of "status": "complete" , or includes an extra key that breaks the downstream parser. The whole pipeline falls over. This post is about how we handle that problem — specifically, how we use structured outputs to get reliable, typed data from LLMs in production Django applications, and where the approach still has limits. When you ask an LLM to "return JSON", it usually does. Until it doesn't. The failure modes are predictable once you've seen them enough times: json ... customer id vs customerId vs customer id None of this is surprising — the model is a text predictor, not a JSON serialiser. Treating its output as reliable structured data requires you to either enforce structure at generation time, or write defensive parsing code that handles every variant. The second path is a maintenance problem that compounds over time. The cleaner approach is to constrain what the model can generate. OpenAI's structured outputs feature available since late 2024 lets you pass a JSON schema to the API, and the model is guaranteed to return output that conforms to it. No code fences, no stray fields, no type mismatches. We define our schemas with Pydantic and pass them directly to the API: python from pydantic import BaseModel from openai import OpenAI from typing import Literal client = OpenAI class ExtractionResult BaseModel : company name: str industry: str annual revenue usd: int | None employee count: int | None confidence: Literal "high", "medium", "low" notes: str def extract company info raw text: str - ExtractionResult: response = client.beta.chat.completions.parse model="gpt-4o-2024-08-06", messages= { "role": "system", "content": "Extract structured company information from the provided text. " "Use null for fields you cannot determine with reasonable confidence." , }, {"role": "user", "content": raw text}, , response format=ExtractionResult, return response.choices 0 .message.parsed The return value is a proper Pydantic model instance. You can access result.company name directly, pass it to a Django serializer, store it in a JSONField — it is typed data, not a string you have to parse. We use this pattern in a document processing pipeline where we extract key fields from uploaded contracts and business documents before routing them for human review. python models.py from django.db import models class Document models.Model : STATUS CHOICES = "pending", "Pending" , "processing", "Processing" , "extracted", "Extracted" , "failed", "Failed" , "needs review", "Needs Review" , file = models.FileField upload to="documents/" raw text = models.TextField blank=True extracted data = models.JSONField null=True, blank=True extraction confidence = models.CharField max length=10, blank=True status = models.CharField max length=20, choices=STATUS CHOICES, default="pending" created at = models.DateTimeField auto now add=True tasks.py Celery from celery import shared task from openai import OpenAI from pydantic import BaseModel, ValidationError from typing import Literal import logging logger = logging.getLogger name client = OpenAI class ContractExtraction BaseModel : counterparty name: str contract value usd: int | None start date: str | None ISO 8601 end date: str | None auto renewal: bool governing law: str | None confidence: Literal "high", "medium", "low" @shared task def extract document fields document id: int : from .models import Document doc = Document.objects.get id=document id doc.status = "processing" doc.save update fields= "status" try: response = client.beta.chat.completions.parse model="gpt-4o-2024-08-06", messages= { "role": "system", "content": "Extract key fields from this contract. " "Use null for fields not present or unclear. " "Set confidence to 'low' if you are uncertain about any critical field." , }, {"role": "user", "content": doc.raw text :8000 }, Stay within context , response format=ContractExtraction, result = response.choices 0 .message.parsed doc.extracted data = result.model dump doc.extraction confidence = result.confidence doc.status = "needs review" if result.confidence == "low" else "extracted" except Exception as e: logger.error f"Extraction failed for document {document id}: {e}" doc.status = "failed" doc.save The key decision here: low-confidence extractions automatically route to human review. The confidence field is part of the schema — we instruct the model to self-report uncertainty, and we act on it. This is the same principle as our agent designs: the human review path is first-class, not a fallback. The one case structured outputs cannot prevent is a model refusal. If the model decides the input violates its content policy, response.choices 0 .message.parsed will be None and response.choices 0 .message.refusal will contain the refusal message. This needs explicit handling: message = response.choices 0 .message if message.refusal: logger.warning f"Model refused extraction for document {document id}: {message.refusal}" doc.status = "needs review" doc.save update fields= "status" return result = message.parsed In practice, refusals are rare for document extraction tasks. They are more common when you are doing classification or analysis on content that might be flagged — customer support tickets, forum posts, unmoderated user content. If your pipeline processes that kind of input, test refusal handling early. If you are using Anthropic's Claude models which we also use for some tasks , the equivalent mechanism is tool use. You define a tool with a JSON schema, instruct the model to always call it, and get structured output through the tool call rather than the message content. python import anthropic import json client = anthropic.Anthropic extraction tool = { "name": "extract contract fields", "description": "Extract structured fields from the contract text.", "input schema": { "type": "object", "properties": { "counterparty name": {"type": "string"}, "contract value usd": {"type": "integer", "null" }, "start date": {"type": "string", "null" }, "end date": {"type": "string", "null" }, "auto renewal": {"type": "boolean"}, "confidence": {"type": "string", "enum": "high", "medium", "low" }, }, "required": "counterparty name", "auto renewal", "confidence" , }, } def extract with claude raw text: str - dict: response = client.messages.create model="claude-opus-4-5", max tokens=1024, tools= extraction tool , tool choice={"type": "tool", "name": "extract contract fields"}, messages= {"role": "user", "content": f"Extract fields from this contract:\n\n{raw text}"} , tool use block = next b for b in response.content if b.type == "tool use" return tool use block.input Already a dict, schema-validated The tool choice parameter forces the model to always call the specified tool rather than choosing to respond in prose. Without it, the model might sometimes call the tool and sometimes answer in text — not useful in a production pipeline. A few things worth being clear about: They do not fix bad prompts. If your system prompt is vague about what a field should contain, you will get consistent structure but inconsistent semantics. confidence: "high" means whatever the model inferred it means, not whatever you intended. Schema design and prompt design go together. They do not prevent hallucination. The model can still make up a contract value or misattribute a date. You are getting reliably shaped data — its accuracy still depends on the model's reasoning and the quality of the source text. For high-stakes fields, add a verification step that cross-checks extracted values against source text. They add latency. Structured output generation with constrained decoding is slightly slower than unconstrained generation. For real-time user-facing features, measure this before committing to the pattern. For background processing pipelines, it generally does not matter. Structured outputs are not exotic — they are just the right default when you need typed data from an LLM. Free-text parsing is a trap that costs you maintenance time and production incidents over the long run. If you are building an LLM integration that outputs data to a database, an API, or another system: define a Pydantic schema, use response format , handle refusals, and route low-confidence results to human review. That is the pattern. It is not complicated once you have seen it, but it makes a meaningful difference in how reliably the system runs. Lycore builds production AI systems https://www.lycore.com/ai-development-services/ for businesses — document intelligence, agents, RAG pipelines, and custom LLM integrations on Django, React, Flutter, and .NET. Get in touch https://www.lycore.com/contact-us/ if you want to talk through your use case.