LLMs suck at generating large, structured data. Tips on how to get your AI agent to do it reliably

A developer building production-grade AI applications has found that large language models fail at generating structured data reliably, encountering issues like schema drift, hallucinated fields, and all-or-nothing failures with JSON outputs. The engineer implemented an object-oriented Builder pattern approach where the model incrementally constructs structured output through tool calls rather than producing a final JSON blob, with the structured data accumulating outside the token window. This method, which the developer applied to an AI app processing insurance claims and legal documents, solves problems with context limits and allows agents to compress conversations mid-flight without losing collected data.

LLMs are great at generating text. They're terrible at generating structured data reliably. If you've ever tried to get an agent to produce a JSON object with a specific schema, you know the pain: missing fields, hallucinated keys, inconsistent types, and outputs that break your downstream pipeline. As I got past toy examples and labs to work on real, production-grade AI apps, I faced the problem and found an approach that works remarkably well for an AI app I'm building: use tools like object-oriented programming Builder pattern . Instead of asking the model to produce a final JSON blob, you give it tools that incrementally build the output - like calling methods on an object. The model never sees or produces the final structure directly. It just calls functions, and the structured output emerges as a side effect. This matters especially when your agent processes large documents like insurance forms, legal filings, medical records that eat up most of the context window. When the input is big and the task is multi-step, you can't afford to also reserve space for a massive structured output at the end. The accumulator pattern lets you compress the conversation mid-flight without losing any of the structured data you've already collected, because that data lives outside the token window entirely. The naive approach - asking a model to output a complete JSON structure - fails in predictable ways: Schema drift. The model forgets required fields, invents new ones, or changes types between runs. A date field might be a string one time and an object the next. All-or-nothing failure. If the model makes one mistake in a 200-line JSON output, the entire thing is unparseable. You either retry the whole generation or write brittle fixup code. No incremental progress. If the model hits a context limit or stops mid-generation, you lose everything. There's no partial result to recover from. Hallucination in structure. Models are more likely to hallucinate when producing structured output in one shot. They fill in fields they're uncertain about rather than leaving them empty, because the structure demands completeness. Coupling research and output. When an agent needs to gather information and produce structured output, asking it to do both in one pass means it can't iterate. It commits to a structure before it has all the facts. response format and function-calling schemas aren't enough Structured output modes like OpenAI's response format: json schema or Bedrock's tool result schemas help with syntax - you'll get valid JSON. But they don't solve the semantic problem. The model still has to produce the entire structure in one shot, and it still hallucinates content to fill required fields. Any team building autonomous or semi-autonomous agents face this, not just me. Kiro CLI, AWS' agentic dev companion, for instance, struggled hard with large data structures when first launched. Since then, its maintainers have equipped its harness with JSON capabilities jq manipulations, for instance and multiples strategies extensive use of grep, glob, tail.. to avoid filling the context window. Still, happy to know I'm not alone in facing this : Here are a few tricks I have used successfully to control both agent output and context window. As I don't claim to have all the recipes, don't hesitate to comment your own or tag my in your own posts : The core idea: define tools that act like OOP builder methods. Each tool call adds one well-typed element to an accumulator. The model's job shifts from "produce this structure" to "call these functions in the right order." Here's the pattern - imagine an agent that processes insurance claims by reading documents and building a structured claim assessment: python from strands import tool The accumulator - this is your structured output claim output = { "parties": , "events": , "damages": , "evidence": , "assessment": None, } def reset output : claim output "assessment" = None for k in "parties", "events", "damages", "evidence" : claim output k = @tool def add party name: str, role: str, policy id: str = "" - str: """Register a party involved in the claim. Args: name: Full name of the person or organization. role: One of: claimant, insured, witness, adjuster, third party policy id: Policy number if applicable. Returns: Confirmation with party details. """ if role not in "claimant", "insured", "witness", "adjuster", "third party" : return f"Error: invalid role '{role}'. Must be one of: claimant, insured, witness, adjuster, third party" claim output "parties" .append { "name": name, "role": role, "policy id": policy id, } return f"Added {role}: {name}" @tool def add event description: str, date: str, location: str = "" - str: """Record a chronological event relevant to the claim. Args: description: What happened 1-3 sentences . date: ISO date string YYYY-MM-DD . location: Where it happened optional . """ claim output "events" .append { "description": description, "date": date, "location": location, } return f"Recorded event on {date} {len claim output 'events' } events total " @tool def add damage item: str, amount: float, category: str, evidence ref: str = "" - str: """Register a damage item with estimated cost. Args: item: Description of the damaged item or cost. amount: Estimated cost in dollars. category: One of: property, medical, liability, lost income evidence ref: Reference to supporting evidence optional . """ if category not in "property", "medical", "liability", "lost income" : return f"Error: invalid category '{category}'." claim output "damages" .append { "item": item, "amount": amount, "category": category, "evidence ref": evidence ref, } total = sum d "amount" for d in claim output "damages" return f"Added damage: {item} ${amount:.2f} . Running total: ${total:.2f}" The agent is given these tools and a system prompt that tells it to process a claim. As it reads documents and discovers information, it calls add party , add event , and add damage . The structured output builds up incrementally. Each tool call is a validation checkpoint. You can reject bad input immediately: python @tool def add damage item: str, amount: float, category: str, evidence ref: str = "" - str: if category not in "property", "medical", "liability", "lost income" : return f"Error: invalid category '{category}'." if amount <= 0: return f"Error: amount must be positive, got {amount}." if evidence ref and evidence ref not in e "id" for e in claim output "evidence" : return f"Error: evidence '{evidence ref}' not registered. Call add evidence first." ... The model gets instant feedback. If it tries to reference evidence it hasn't registered yet, the tool tells it. The model self-corrects on the next turn. Compare this to validating a 500-line JSON blob after the fact - by then, the model has moved on and can't fix its mistakes in context. A key benefit: the same agent can have reading tools and writing tools. Reading tools fetch and explore data. Writing tools construct the output. The model interleaves them naturally: agent = Agent model=model, system prompt=prompt, tools= Reading tools read document, search policy, get weather report, Writing tools builder methods add party, add event, add damage, add evidence, set assessment, Progress tracking mark step done, , One call - the agent reads documents AND builds structured output agent "Process this claim: " + claim text Output is ready print claim output The model reads a police report, extracts a party, reads a medical bill, registers a damage item, cross-references the policy, and so on. Research and output construction are interleaved rather than sequential. Because output accumulates incrementally, you get crash recovery for free: STEPS = "1. Identify all parties", "2. Establish timeline of events", "3. Catalog damages with evidence", "4. Cross-reference policy coverage", "5. Produce assessment", completed steps: list int = @tool def mark step done step number: int - str: """Mark a processing step as completed.""" completed steps.append step number remaining = s for i, s in enumerate STEPS, 1 if i not in completed steps return f"Step {step number} done. Remaining: {', '.join remaining }" If the agent hits a context window limit or errors out, you already have partial results - every party identified, every event recorded, every damage item cataloged up to that point. You can resume or use what you have. Here's where this pattern really pays off. When your agent ingests a 30-page document and then makes dozens of tool calls to fetch additional sources, the context window fills up fast. In a traditional approach, you'd lose your structured output along with the conversation when you hit the limit. But because the accumulator lives in Python memory - not in the message history - you can aggressively compress the conversation without losing a single data point. A custom conversation manager a possibility offered, for instance, by the Strands Agents SDK https://strandsagents.com/docs/user-guide/concepts/agents/conversation-management/ creating-a-conversationmanager replaces old messages with a compact state summary derived from the accumulator: python class ClaimConversationManager ConversationManager : def apply management self, agent, kwargs : messages = agent.messages if len messages <= 2: return Keep first message + last 2 messages Replace everything in between with a state summary first msg = messages 0 recent = messages -2: state = self. build state summary state msg = { "role": "user", "content": {"text": f" STATE \n{state}\n\nContinue."} , } messages : = first msg, state msg + recent def build state summary self - str: """Summarize what's been done using the accumulator state.""" lines = if claim output "parties" : parties = f"{p 'name' } {p 'role' } " for p in claim output "parties" lines.append f"Parties: {', '.join parties }" if claim output "damages" : total = sum d "amount" for d in claim output "damages" lines.append f"Damages: {len claim output 'damages' } items, ${total:.2f} total" if claim output "events" : lines.append f"Events: {len claim output 'events' } recorded" return "\n".join lines Because the structured output lives in Python not in the conversation , context compression doesn't lose any data. The model can always see what it's already produced by reading the state summary. Each tool has typed parameters enforced by the framework. The model must provide a category that's one of property, medical, liability, lost income - not because you're parsing JSON and checking after the fact, but because the tool signature demands it. Invalid calls get rejected with clear error messages. Tools compose naturally. You can add new output fields by adding new tools without changing existing ones. Want to track evidence attachments? Add an add evidence tool. Want a final recommendation? Add a set assessment tool. The model discovers new capabilities through its tool list. Each tool is a pure function or close to it . You can unit test them independently: python def test add damage rejects invalid category : reset output result = add damage item="Roof repair", amount=5000, category="cosmetic" assert "Error" in result assert len claim output "damages" == 0 def test add damage tracks total : reset output add damage item="Roof repair", amount=5000, category="property" add damage item="Water damage", amount=2000, category="property" assert len claim output "damages" == 2 assert sum d "amount" for d in claim output "damages" == 7000 The output schema is defined by your Python code, not by the model's interpretation of a prompt. claim output always has the same keys with the same types. Downstream consumers can rely on the structure unconditionally. If the model runs out of context or hits an error, you have everything it produced up to that point. You can even detect empty output and retry with a nudge: try: agent claim text except Exception: pass if not claim output "parties" and not claim output "events" : agent "You haven't started processing. Begin by identifying the parties involved." The model doesn't have to context-switch between "thinking" and "formatting." It thinks by calling tools. The structured output is a byproduct of the agent doing its job, not an additional formatting burden layered on top. This pattern - tools as Builder, accumulator as output, validation at the boundary - has been the most reliable way I've found to get structured data out of an agentic workflow. It works because it aligns with how tool-calling models already behave: they reason, they act, they observe results, and they act again. You're just making "act" mean "build one piece of the output."