doceval — eval harness for LLM document extraction pipelines

A developer built doceval, an evaluation harness for LLM document extraction pipelines that provides field-level accuracy, failure taxonomy, and optional cost tracking. The tool works with any extractor and document schema, requiring only a JSON label file, a Python function, and a CLI command. It includes a working 20-document invoice example with a Claude Haiku extractor.

I kept seeing the same gap: people ship LLM-based document extractors invoices, receipts, forms with no systematic way to know how accurate they actually are. So I built doceval — point it at your extractor function + a labeled dataset and get back field-level accuracy, a failure taxonomy missed field / hallucination / wrong format / wrong value , and optional per-document cost tracking. Works with any extractor Claude, GPT, regex, rules and any document schema. One JSON label file per document, one Python function, one CLI command. Includes a working 20-document invoice example with a Claude Haiku extractor so you can run it immediately.