cd /news/large-language-models/llms-suck-at-generating-large-struct… · home topics large-language-models article
[ARTICLE · art-17581] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

LLMs suck at generating large, structured data. Tips on how to get your AI agent to do it reliably

A developer building production-grade AI applications has found that large language models fail at generating structured data reliably, encountering issues like schema drift, hallucinated fields, and all-or-nothing failures with JSON outputs. The engineer implemented an object-oriented Builder pattern approach where the model incrementally constructs structured output through tool calls rather than producing a final JSON blob, with the structured data accumulating outside the token window. This method, which the developer applied to an AI app processing insurance claims and legal documents, solves problems with context limits and allows agents to compress conversations mid-flight without losing collected data.

read9 min publishedMay 29, 2026

LLMs are great at generating text. They're terrible at generating structured data reliably. If you've ever tried to get an agent to produce a JSON object with a specific schema, you know the pain: missing fields, hallucinated keys, inconsistent types, and outputs that break your downstream pipeline.

As I got past toy examples and labs to work on real, production-grade AI apps, I faced the problem and found an approach that works remarkably well for an AI app I'm building: use tools like object-oriented programming Builder pattern. Instead of asking the model to produce a final JSON blob, you give it tools that incrementally build the output - like calling methods on an object. The model never sees or produces the final structure directly. It just calls functions, and the structured output emerges as a side effect.

This matters especially when your agent processes large documents (like insurance forms, legal filings, medical records) that eat up most of the context window. When the input is big and the task is multi-step, you can't afford to also reserve space for a massive structured output at the end. The accumulator pattern lets you compress the conversation mid-flight without losing any of the structured data you've already collected, because that data lives outside the token window entirely.

The naive approach - asking a model to output a complete JSON structure - fails in predictable ways:

Schema drift. The model forgets required fields, invents new ones, or changes types between runs. A date

field might be a string one time and an object the next.

All-or-nothing failure. If the model makes one mistake in a 200-line JSON output, the entire thing is unparseable. You either retry the whole generation or write brittle fixup code.

No incremental progress. If the model hits a context limit or stops mid-generation, you lose everything. There's no partial result to recover from.

Hallucination in structure. Models are more likely to hallucinate when producing structured output in one shot. They fill in fields they're uncertain about rather than leaving them empty, because the structure demands completeness.

Coupling research and output. When an agent needs to gather information and produce structured output, asking it to do both in one pass means it can't iterate. It commits to a structure before it has all the facts.

response_format

and function-calling schemas aren't enough Structured output modes (like OpenAI's response_format: json_schema

or Bedrock's tool result schemas) help with syntax - you'll get valid JSON. But they don't solve the semantic problem. The model still has to produce the entire structure in one shot, and it still hallucinates content to fill required fields.

Any team building autonomous or semi-autonomous agents face this, not just me. Kiro CLI, AWS' agentic dev companion, for instance, struggled hard with large data structures when first launched.

Since then, its maintainers have equipped its harness with JSON capabilities (jq

manipulations, for instance) and multiples strategies (extensive use of grep, glob, tail..) to avoid filling the context window.

Still, happy to know I'm not alone in facing this :)

Here are a few tricks I have used successfully to control both agent output and context window. As I don't claim to have all the recipes, don't hesitate to comment your own or tag my in your own posts :)

The core idea: define tools that act like OOP builder methods. Each tool call adds one well-typed element to an accumulator. The model's job shifts from "produce this structure" to "call these functions in the right order."

Here's the pattern - imagine an agent that processes insurance claims by reading documents and building a structured claim assessment:

from strands import tool

claim_output = {
    "parties": [],
    "events": [],
    "damages": [],
    "evidence": [],
    "assessment": None,
}

def reset_output():
    claim_output["assessment"] = None
    for k in ["parties", "events", "damages", "evidence"]:
        claim_output[k] = []

@tool
def add_party(name: str, role: str, policy_id: str = "") -> str:
    """Register a party involved in the claim.

    Args:
        name: Full name of the person or organization.
        role: One of: claimant, insured, witness, adjuster, third_party
        policy_id: Policy number if applicable.

    Returns:
        Confirmation with party details.
    """
    if role not in ("claimant", "insured", "witness", "adjuster", "third_party"):
        return f"Error: invalid role '{role}'. Must be one of: claimant, insured, witness, adjuster, third_party"

    claim_output["parties"].append({
        "name": name,
        "role": role,
        "policy_id": policy_id,
    })
    return f"Added {role}: {name}"

@tool
def add_event(description: str, date: str, location: str = "") -> str:
    """Record a chronological event relevant to the claim.

    Args:
        description: What happened (1-3 sentences).
        date: ISO date string (YYYY-MM-DD).
        location: Where it happened (optional).
    """
    claim_output["events"].append({
        "description": description,
        "date": date,
        "location": location,
    })
    return f"Recorded event on {date} ({len(claim_output['events'])} events total)"

@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    """Register a damage item with estimated cost.

    Args:
        item: Description of the damaged item or cost.
        amount: Estimated cost in dollars.
        category: One of: property, medical, liability, lost_income
        evidence_ref: Reference to supporting evidence (optional).
    """
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."

    claim_output["damages"].append({
        "item": item,
        "amount": amount,
        "category": category,
        "evidence_ref": evidence_ref,
    })
    total = sum(d["amount"] for d in claim_output["damages"])
    return f"Added damage: {item} (${amount:.2f}). Running total: ${total:.2f}"

The agent is given these tools and a system prompt that tells it to process a claim. As it reads documents and discovers information, it calls add_party

, add_event

, and add_damage

. The structured output builds up incrementally.

Each tool call is a validation checkpoint. You can reject bad input immediately:

@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."
    if amount <= 0:
        return f"Error: amount must be positive, got {amount}."
    if evidence_ref and evidence_ref not in [e["id"] for e in claim_output["evidence"]]:
        return f"Error: evidence '{evidence_ref}' not registered. Call add_evidence first."

The model gets instant feedback. If it tries to reference evidence it hasn't registered yet, the tool tells it. The model self-corrects on the next turn. Compare this to validating a 500-line JSON blob after the fact - by then, the model has moved on and can't fix its mistakes in context.

A key benefit: the same agent can have reading tools and writing tools. Reading tools fetch and explore data. Writing tools construct the output. The model interleaves them naturally:

agent = Agent(
    model=model,
    system_prompt=prompt,
    tools=[
        read_document,
        search_policy,
        get_weather_report,
        add_party,
        add_event,
        add_damage,
        add_evidence,
        set_assessment,
        mark_step_done,
    ],
)

agent("Process this claim: " + claim_text)

print(claim_output)

The model reads a police report, extracts a party, reads a medical bill, registers a damage item, cross-references the policy, and so on. Research and output construction are interleaved rather than sequential.

Because output accumulates incrementally, you get crash recovery for free:

STEPS = [
    "1. Identify all parties",
    "2. Establish timeline of events",
    "3. Catalog damages with evidence",
    "4. Cross-reference policy coverage",
    "5. Produce assessment",
]
completed_steps: list[int] = []

@tool
def mark_step_done(step_number: int) -> str:
    """Mark a processing step as completed."""
    completed_steps.append(step_number)
    remaining = [s for i, s in enumerate(STEPS, 1) if i not in completed_steps]
    return f"Step {step_number} done. Remaining: {', '.join(remaining)}"

If the agent hits a context window limit or errors out, you already have partial results - every party identified, every event recorded, every damage item cataloged up to that point. You can resume or use what you have.

Here's where this pattern really pays off. When your agent ingests a 30-page document and then makes dozens of tool calls to fetch additional sources, the context window fills up fast. In a traditional approach, you'd lose your structured output along with the conversation when you hit the limit. But because the accumulator lives in Python memory - not in the message history - you can aggressively compress the conversation without losing a single data point.

A custom conversation manager (a possibility offered, for instance, by the Strands Agents SDK) replaces old messages with a compact state summary derived from the accumulator:

class ClaimConversationManager(ConversationManager):
    def apply_management(self, agent, **kwargs):
        messages = agent.messages
        if len(messages) <= 2:
            return

        first_msg = messages[0]
        recent = messages[-2:]

        state = self._build_state_summary()
        state_msg = {
            "role": "user",
            "content": [{"text": f"[STATE]\n{state}\n\nContinue."}],
        }
        messages[:] = [first_msg, state_msg] + recent

    def _build_state_summary(self) -> str:
        """Summarize what's been done using the accumulator state."""
        lines = []
        if claim_output["parties"]:
            parties = [f"{p['name']} ({p['role']})" for p in claim_output["parties"]]
            lines.append(f"Parties: {', '.join(parties)}")
        if claim_output["damages"]:
            total = sum(d["amount"] for d in claim_output["damages"])
            lines.append(f"Damages: {len(claim_output['damages'])} items, ${total:.2f} total")
        if claim_output["events"]:
            lines.append(f"Events: {len(claim_output['events'])} recorded")
        return "\n".join(lines)

Because the structured output lives in Python (not in the conversation), context compression doesn't lose any data. The model can always see what it's already produced by reading the state summary.

Each tool has typed parameters enforced by the framework. The model must provide a category

that's one of property, medical, liability, lost_income

  • not because you're parsing JSON and checking after the fact, but because the tool signature demands it. Invalid calls get rejected with clear error messages.

Tools compose naturally. You can add new output fields by adding new tools without changing existing ones. Want to track evidence attachments? Add an add_evidence

tool. Want a final recommendation? Add a set_assessment

tool. The model discovers new capabilities through its tool list.

Each tool is a pure function (or close to it). You can unit test them independently:

def test_add_damage_rejects_invalid_category():
    reset_output()
    result = add_damage(item="Roof repair", amount=5000, category="cosmetic")
    assert "Error" in result
    assert len(claim_output["damages"]) == 0

def test_add_damage_tracks_total():
    reset_output()
    add_damage(item="Roof repair", amount=5000, category="property")
    add_damage(item="Water damage", amount=2000, category="property")
    assert len(claim_output["damages"]) == 2
    assert sum(d["amount"] for d in claim_output["damages"]) == 7000

The output schema is defined by your Python code, not by the model's interpretation of a prompt. claim_output

always has the same keys with the same types. Downstream consumers can rely on the structure unconditionally.

If the model runs out of context or hits an error, you have everything it produced up to that point. You can even detect empty output and retry with a nudge:

try:
    agent(claim_text)
except Exception:
    pass

if not claim_output["parties"] and not claim_output["events"]:
    agent("You haven't started processing. Begin by identifying the parties involved.")

The model doesn't have to context-switch between "thinking" and "formatting." It thinks by calling tools. The structured output is a byproduct of the agent doing its job, not an additional formatting burden layered on top.

This pattern - tools as Builder, accumulator as output, validation at the boundary - has been the most reliable way I've found to get structured data out of an agentic workflow. It works because it aligns with how tool-calling models already behave: they reason, they act, they observe results, and they act again. You're just making "act" mean "build one piece of the output."

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llms-suck-at-generat…] indexed:0 read:9min 2026-05-29 ·