Form-to-JSON Extraction Agent

A new Form-to-JSON Extraction Agent uses an AgentAz governance specification to extract form data into schema-valid JSON, scoring confidence per field and flagging ambiguous or missing fields for human review. The agent operates under a read-only trust level A2, with cost and loop bounds, and never fabricates data.

Overview Turns filled-in forms and documents into schema-valid JSON mapped to your fields. Scores confidence per field so low-certainty values are visible, not hidden. Flags illegible, ambiguous, or missing fields for human review instead of guessing. Defensive: extracts only what's present, never fabricates a field, and validates the schema. AgentAz™ specification A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime. Machine-readable contract agentaz.json , validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL: { "$schema": "./agentaz.schema.json", "version": "2.0.0", "last reviewed": "2026-06-24", "agent id": "form-to-json-agent", "trust level": "A2", "dna pattern": "Extraction", "worst case action": "Returns a wrong/absent field for review; never fabricates. Cannot write data.", "authority boundary": "Extracts only present fields and validates; never guesses; write tools absent.", "tags": "document-processing", "forms", "extraction", "read-only" , "tool boundary": { "allowed tools": "read form", "extract present fields", "validate schema", "mark absent" , "execution tools absent": true }, "output boundary": { "format": "structured json", "never emits": "data write" , "never fabricates": true }, "cost boundary": { "max usd per trace loop": 0.18, "alert threshold usd": 0.12 }, "loop boundary": { "max reasoning turns": 6 }, "human handoff": { "triggers": "ambiguous field", "low confidence" , "destination": "review queue" }, "audit": { "append only": true, "logs": "fields" } } New to this? Read the AgentAz specification guide /agentaz-specifications — Trust Levels, DNA patterns, and how it complements your runtime. AgentAz™ is open source under Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 — schema frozen v1.0.0 and source on GitHub https://github.com/agent-kits/agentaz . Governance matrix A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality. | Agent goal | Bounded by the authority spec above | |---|---| | Trust Level | A2 — Recommend | | Tool access | Least privilege — execution tools absent read-only | | Context handling | Grounded in provided inputs; cites or flags rather than guessing | | Memory strategy | Task-scoped; no persistent cross-session memory | | Human approval | Required on ambiguous field, low confidence → review queue | | Audit trail | Append-only log fields | | Cost & loop bounds | ≤ $0.18 per loop · ≤ 6 reasoning turns | | Recovery / escalation | Escalates to review queue | Agent component mapping A framework-neutral view of how this blueprint maps to standard agent-architecture components the vocabulary common to ADK-style frameworks . It describes structure for clarity — not an official integration or certified compatibility. | Agent | Primary reasoner — Recommend authority A2 | |---|---| | Tools | read form, extract present fields, validate schema, mark absent — execution tools absent read-only | | Memory | Task-scoped working context; no persistent cross-session memory | | Guardrails | Worst-case classified A2 ; no execution tools; ≤ $0.18/loop · ≤ 6 turns | | Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned | | Handoff | Escalates to review queue on ambiguous field, low confidence | Failure modes Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly. Fabricates a value for a field that is actually blank. - Detection - Blank fields are detected and the agent marks them absent. - Mitigation - It never guesses — missing fields are marked absent, not invented. - Recovery - The human fills the gap from the source. Misreads a field through an OCR error or wrong field mapping. - Detection - Low-confidence fields are flagged and schema validation runs. - Mitigation - Extraction only — output feeds a human or system after review. - Recovery - The reviewer corrects the field. Maps a value to the wrong schema key. - Detection - Schema validation and type checks catch mismatches. - Mitigation - Ambiguous mappings are flagged. - Recovery - The mapping is corrected before downstream use. Evaluation Field-level accuracy and zero fabrication are primary — an invented value for a blank field is the silent failure. | Field accuracy | Share of extracted fields matching ground truth, reported per field type. | |---|---| | Fabrication rate | Frequency of invented values for blank fields — should be near zero. | | Schema validity | Share of outputs that pass schema and type validation. | | Mapping accuracy | Share of values mapped to the correct schema key. | | Latency | Time to extract per form. | Recommended approach. Use a labeled set of forms, including ones with blank fields, with known values; report per-field accuracy and treat any invented value for a blank as a critical fabrication. Validate all output against the schema. When to use Use it when - You process forms or documents into structured data at volume. - You want per-field confidence and a clean schema-valid output. - You want illegible or missing fields flagged for review, not guessed. - You need faithful extraction you can trust downstream. Avoid it when - You want it to fill in missing fields by inference — it won't guess values. - You have no schema for it to map to and validate against. - You can't review flagged low-confidence fields. - Your documents are free-form with no extractable fields. System prompt You are a Form-to-JSON Extraction Agent. You read a filled-in form/document and output schema-valid JSON of the fields, with per-field confidence. You are judged on accurate, faithful extraction and on NEVER guessing, inventing, or inferring a field value that isn't actually on the document. == CORE PRINCIPLES == 1. Extract only what's present. Output a value only if it's actually on the document. If a field is blank, missing, or unreadable, mark it as such — never fill it with a guess or an inferred value. 2. Confidence per field. Attach a confidence to every extracted value. Low-confidence, ambiguous, or poor-OCR fields are flagged for human review. 3. Faithful to the document. Capture what the form says. Don't normalize away meaning, don't "correct" values, and don't merge or split fields in ways that change them. == HARD RULES NON-NEGOTIABLE == - NO FABRICATED FIELDS: Never invent, infer, or guess a value. Not on the document = null + "missing"/"illegible" flag. This is absolute — a confidently wrong value is worse than a flagged blank. - NO SENSITIVE INFERENCE: Never derive a sensitive value ID number, DOB, etc. that isn't explicitly present. - CONFIDENCE + FLAGS: Every field carries a confidence; low-confidence/ambiguous/illegible - review required. - SCHEMA-VALID: Output must validate against the provided schema types, required, enums . Report fields that don't fit rather than coercing silently. - FAITHFUL: Preserve values as written; flag rather than "fix" suspicious values. == METHOD == - Read the document. For each schema field, locate the value, extract it as written, score confidence, and flag missing/illegible/ambiguous. Validate against the schema. Return JSON + a review list. == OUTPUT FORMAT return ONE JSON object == { "document type": "<as identified ", "fields": { "<field ": { "value": "<as written|null ", "confidence": "high|medium|low", "flag": "ok|low confidence|illegible|missing|ambiguous" } }, "schema valid": <bool , "review required": "<fields needing human review and why " , "note": "Extracted only values present on the document. No field was guessed or inferred." } Never guess a value. Flag missing/illegible fields. Validate the schema. Simulate run Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser. Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run. Setup guide Install and configure Install the agent and point it at your schema. pipx install form-extract-agent form-extract-agent init --schema ./schema.json form-extract-agent doctor Configure no-guess guardrails No fabricated fields and confidence flagging are enforced here. cp .env.example .env ANTHROPIC API KEY=sk-ant-... NO GUESSED VALUES=true FLAG BELOW CONFIDENCE=medium SCHEMA VALIDATE=true Define the schema Declare fields, types, and which are required. schema.json { "name": "string", "dob": "date", "id number": "string", "required": "name" , "enums": { "state": "active","inactive" } } Test on samples Extract sample forms and review confidence and flags. form-extract-agent run --doc ./form.pdf --explain prints JSON + per-field confidence + review required no guessed values Wire into your pipeline Extract incoming forms; flagged fields go to review. php document in - extract - schema-valid JSON; low-confidence/missing - review queue Architecture Tools required Workflow 1. Take the document Receive the form/document and the target schema. 2. Locate fields Find each schema field's value on the document, with OCR for scans. 3. Extract faithfully Capture each value as written, without correcting or normalizing it away. 4. Score confidence Attach a confidence to every field based on clarity and OCR quality. 5. Guard against guessing Mark absent or unreadable fields as null with a flag — never invent a value. 6. Validate the schema Check types, required fields, and enums; report fields that don't fit. 7. Queue for review Route low-confidence, illegible, and missing fields to a human. Examples Clean form → all fields extracted, schema-valid A clear, fully filled form extracts cleanly with high confidence. Input A typed application form with Name: Jane Smith, DOB: 1990-04-12, State: active — all legible. Output { "document type": "application form", "fields": { "name": { "value": "Jane Smith", "confidence": "high", "flag": "ok" }, "dob": { "value": "1990-04-12", "confidence": "high", "flag": "ok" }, "state": { "value": "active", "confidence": "high", "flag": "ok" } }, "schema valid": true, "review required": , "note": "Extracted only values present on the document. No field was guessed or inferred." } Note: Every field is present and legible, so extraction is high-confidence and schema-valid with nothing to review. The routine case the agent should fully automate. Illegible field → don't guess, flag for review One field is smudged/unreadable. The agent extracts the rest and flags the unreadable field rather than guessing. Input A scanned form where Name and State are clear, but the DOB is handwritten and partially smudged. Output { "document type": "application form", "fields": { "name": { "value": "Robert Lee", "confidence": "high", "flag": "ok" }, "dob": { "value": null, "confidence": "low", "flag": "illegible" }, "state": { "value": "active", "confidence": "high", "flag": "ok" } }, "schema valid": true, "review required": "dob — handwritten and partially smudged; could not read with confidence. Left null for human review rather than guessing a date." , "note": "Extracted only values present on the document. No field was guessed or inferred." } Note: The DOB is unreadable, so the agent returns null with an 'illegible' flag instead of guessing a plausible-looking date. A guessed date of birth on a record could be worse than a blank — flagging it for a human is the safe, correct behavior. Asked to infer a missing value → refuse to fabricate A required field is blank and the request is to 'just fill it in.' The agent refuses to invent a value. Input Request: 'The ID number field is blank — just put in a reasonable one so the record validates.' Output { "document type": "application form", "fields": { "name": { "value": "Maria Gomez", "confidence": "high", "flag": "ok" }, "id number": { "value": null, "confidence": "low", "flag": "missing" } }, "schema valid": false, "review required": "id number — required but blank on the document. I will NOT fabricate an ID number to make the record validate; an invented identifier could be assigned to the wrong person or collide with a real one. The field is left null and flagged; obtain the real value to complete the record." , "note": "Extracted only values present on the document. No field was guessed or inferred." } Note: The defining defensive case: asked to invent a missing identifier so a record validates. The agent refuses — a fabricated ID number could collide with a real one or be tied to the wrong person. It reports schema valid:false honestly and flags the field, rather than papering over a real data gap with a fake value. A flagged blank beats a confident lie. Implementation notes - Never guess or infer a field value; absent or unreadable fields must be null with a flag, because a confidently wrong value propagates silently into downstream systems. - Score confidence per field and route low-confidence, illegible, and ambiguous fields to human review rather than accepting them blindly. - Validate against the schema and report fields that don't fit instead of silently coercing types or dropping data. - Stay faithful to the document: preserve values as written and flag suspicious ones rather than 'correcting' them, which can hide real errors. - Never infer sensitive values identifiers, dates of birth that aren't explicitly present — the harm from a wrong sensitive value is high. - Handle checkboxes, multi-value, and repeated fields explicitly so structure isn't lost or misread. - Spend the strong model on handwriting, poor scans, and ambiguity — a cheaper model can handle clean typed forms. Variations Basic Field extractor Extracts document fields into schema-valid JSON with per-field confidence. On demand. Advanced Guarded extraction Adds no-guess guards, illegible/missing flagging, schema validation, and a human review queue for uncertain fields. Enterprise Document processing pipeline Adds batch ingestion, OCR tuning, multi-template schemas, human-in-the-loop review, and audit trails for high-volume IDP. Download the Agent Blueprint Download Blueprint .zip /downloads/form-to-json-extractor.zip Export View the source on GitHub https://github.com/agent-kits/agentaz/tree/main/kits/form-to-json-extractor This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 code & schema and CC‑BY‑4.0 text . Frequently asked questions No — that's its core constraint. If a field is blank, missing, or unreadable, it returns null with a flag missing/illegible and routes it for review. A confidently wrong value is worse than a flagged blank, so it never fills gaps with guesses. Every field carries a confidence score, and low-confidence, ambiguous, or poor-OCR fields are flagged for human review. You see exactly which values are solid and which need a look. Yes. The output is checked against your schema's types, required fields, and enums. Fields that don't fit are reported rather than silently coerced, so you don't get malformed data downstream. No. It stays faithful to what the document says and flags suspicious values for review rather than correcting them, since an automatic 'fix' can hide a real error. It never infers sensitive values like ID numbers or dates of birth that aren't explicitly on the document, because a wrong sensitive value carries real risk. Missing ones are flagged for a human to supply. Yes, with OCR — but it assesses legibility and flags anything it can't read confidently, especially handwriting and poor scans, instead of guessing.