I kept seeing the same gap: people ship LLM-based document extractors (invoices, receipts, forms) with no systematic way to know how accurate they actually are. So I built doceval — point it at your extractor function + a labeled dataset and get back field-level accuracy, a failure taxonomy (missed_field / hallucination / wrong_format / wrong_value), and optional per-document cost tracking.
Works with any extractor (Claude, GPT, regex, rules) and any document schema. One JSON label file per document, one Python function, one CLI command.
Includes a working 20-document invoice example with a Claude Haiku extractor so you can run it immediately.