Structured Outputs: How We Stopped Parsing LLM Responses by Hand

wpnews.pro

Every team we talk to has a version of the same story. They built an LLM integration that works well in testing. Then, three weeks into production, something comes back slightly different — the model wraps the JSON in a code block, or uses "status": "Completed"

instead of "status": "complete"

, or includes an extra key that breaks the downstream parser. The whole pipeline falls over.

This post is about how we handle that problem — specifically, how we use structured outputs to get reliable, typed data from LLMs in production Django applications, and where the approach still has limits.

When you ask an LLM to "return JSON", it usually does. Until it doesn't.

The failure modes are predictable once you've seen them enough times:

json ...

)customer_id

vs customerId

vs customer id

)None of this is surprising — the model is a text predictor, not a JSON serialiser. Treating its output as reliable structured data requires you to either enforce structure at generation time, or write defensive parsing code that handles every variant. The second path is a maintenance problem that compounds over time.

The cleaner approach is to constrain what the model can generate. OpenAI's structured outputs feature (available since late 2024) lets you pass a JSON schema to the API, and the model is guaranteed to return output that conforms to it. No code fences, no stray fields, no type mismatches.

We define our schemas with Pydantic and pass them directly to the API:

from pydantic import BaseModel
from openai import OpenAI
from typing import Literal

client = OpenAI()

class ExtractionResult(BaseModel):
    company_name: str
    industry: str
    annual_revenue_usd: int | None
    employee_count: int | None
    confidence: Literal["high", "medium", "low"]
    notes: str

def extract_company_info(raw_text: str) -> ExtractionResult:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": (
                    "Extract structured company information from the provided text. "
                    "Use null for fields you cannot determine with reasonable confidence."
                ),
            },
            {"role": "user", "content": raw_text},
        ],
        response_format=ExtractionResult,
    )
    return response.choices[0].message.parsed

The return value is a proper Pydantic model instance. You can access result.company_name

directly, pass it to a Django serializer, store it in a JSONField — it is typed data, not a string you have to parse.

We use this pattern in a document processing pipeline where we extract key fields from uploaded contracts and business documents before routing them for human review.

from django.db import models

class Document(models.Model):
    STATUS_CHOICES = [
        ("pending", "Pending"),
        ("processing", "Processing"),
        ("extracted", "Extracted"),
        ("failed", "Failed"),
        ("needs_review", "Needs Review"),
    ]

    file = models.FileField(upload_to="documents/")
    raw_text = models.TextField(blank=True)
    extracted_data = models.JSONField(null=True, blank=True)
    extraction_confidence = models.CharField(max_length=10, blank=True)
    status = models.CharField(max_length=20, choices=STATUS_CHOICES, default="pending")
    created_at = models.DateTimeField(auto_now_add=True)

from celery import shared_task
from openai import OpenAI
from pydantic import BaseModel, ValidationError
from typing import Literal
import logging

logger = logging.getLogger(__name__)
client = OpenAI()

class ContractExtraction(BaseModel):
    counterparty_name: str
    contract_value_usd: int | None
    start_date: str | None  # ISO 8601
    end_date: str | None
    auto_renewal: bool
    governing_law: str | None
    confidence: Literal["high", "medium", "low"]

@shared_task
def extract_document_fields(document_id: int):
    from .models import Document

    doc = Document.objects.get(id=document_id)
    doc.status = "processing"
    doc.save(update_fields=["status"])

    try:
        response = client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Extract key fields from this contract. "
                        "Use null for fields not present or unclear. "
                        "Set confidence to 'low' if you are uncertain about any critical field."
                    ),
                },
                {"role": "user", "content": doc.raw_text[:8000]},  # Stay within context
            ],
            response_format=ContractExtraction,
        )

        result = response.choices[0].message.parsed

        doc.extracted_data = result.model_dump()
        doc.extraction_confidence = result.confidence
        doc.status = "needs_review" if result.confidence == "low" else "extracted"

    except Exception as e:
        logger.error(f"Extraction failed for document {document_id}: {e}")
        doc.status = "failed"

    doc.save()

The key decision here: low-confidence extractions automatically route to human review. The confidence field is part of the schema — we instruct the model to self-report uncertainty, and we act on it. This is the same principle as our agent designs: the human review path is first-class, not a fallback.

The one case structured outputs cannot prevent is a model refusal. If the model decides the input violates its content policy, response.choices[0].message.parsed

will be None

and response.choices[0].message.refusal

will contain the refusal message.

This needs explicit handling:

message = response.choices[0].message

if message.refusal:
    logger.warning(f"Model refused extraction for document {document_id}: {message.refusal}")
    doc.status = "needs_review"
    doc.save(update_fields=["status"])
    return

result = message.parsed

In practice, refusals are rare for document extraction tasks. They are more common when you are doing classification or analysis on content that might be flagged — customer support tickets, forum posts, unmoderated user content. If your pipeline processes that kind of input, test refusal handling early.

If you are using Anthropic's Claude models (which we also use for some tasks), the equivalent mechanism is tool use. You define a tool with a JSON schema, instruct the model to always call it, and get structured output through the tool call rather than the message content.

import anthropic
import json

client = anthropic.Anthropic()

extraction_tool = {
    "name": "extract_contract_fields",
    "description": "Extract structured fields from the contract text.",
    "input_schema": {
        "type": "object",
        "properties": {
            "counterparty_name": {"type": "string"},
            "contract_value_usd": {"type": ["integer", "null"]},
            "start_date": {"type": ["string", "null"]},
            "end_date": {"type": ["string", "null"]},
            "auto_renewal": {"type": "boolean"},
            "confidence": {"type": "string", "enum": ["high", "medium", "low"]},
        },
        "required": ["counterparty_name", "auto_renewal", "confidence"],
    },
}

def extract_with_claude(raw_text: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        tools=[extraction_tool],
        tool_choice={"type": "tool", "name": "extract_contract_fields"},
        messages=[
            {"role": "user", "content": f"Extract fields from this contract:\n\n{raw_text}"}
        ],
    )

    tool_use_block = next(b for b in response.content if b.type == "tool_use")
    return tool_use_block.input  # Already a dict, schema-validated

The tool_choice

parameter forces the model to always call the specified tool rather than choosing to respond in prose. Without it, the model might sometimes call the tool and sometimes answer in text — not useful in a production pipeline.

A few things worth being clear about:

They do not fix bad prompts. If your system prompt is vague about what a field should contain, you will get consistent structure but inconsistent semantics. confidence: "high"

means whatever the model inferred it means, not whatever you intended. Schema design and prompt design go together.

They do not prevent hallucination. The model can still make up a contract value or misattribute a date. You are getting reliably shaped data — its accuracy still depends on the model's reasoning and the quality of the source text. For high-stakes fields, add a verification step that cross-checks extracted values against source text.

They add latency. Structured output generation with constrained decoding is slightly slower than unconstrained generation. For real-time user-facing features, measure this before committing to the pattern. For background processing pipelines, it generally does not matter.

Structured outputs are not exotic — they are just the right default when you need typed data from an LLM. Free-text parsing is a trap that costs you maintenance time and production incidents over the long run.

If you are building an LLM integration that outputs data to a database, an API, or another system: define a Pydantic schema, use response_format

, handle refusals, and route low-confidence results to human review. That is the pattern. It is not complicated once you have seen it, but it makes a meaningful difference in how reliably the system runs.

Lycore builds production AI systems for businesses — document intelligence, agents, RAG pipelines, and custom LLM integrations on Django, React, Flutter, and .NET. Get in touch if you want to talk through your use case.

source & further reading

dev.to — original article Cheap AI tokens need request-level receipts Before you sell an AI connector, map the trust boundary DeerFlow 2.0 Review: ByteDance's Open SuperAgent Harness

Structured Outputs: How We Stopped Parsing LLM Responses by Hand

Run your AI side-project on zahid.host