doceval — eval harness for LLM document extraction pipelines

wpnews.pro

cd /news/large-language-models/doceval-eval-harness-for-llm-documen… · home › topics › large-language-models › article

[ARTICLE · art-29445] src=dev.to ↗ pub=2026-06-16T12:29Z topic=large-language-models verified=true sentiment=· neutral

doceval — eval harness for LLM document extraction pipelines

A developer built doceval, an evaluation harness for LLM document extraction pipelines that provides field-level accuracy, failure taxonomy, and optional cost tracking. The tool works with any extractor and document schema, requiring only a JSON label file, a Python function, and a CLI command. It includes a working 20-document invoice example with a Claude Haiku extractor.

read1 min views26 publishedJun 16, 2026

I kept seeing the same gap: people ship LLM-based document extractors (invoices, receipts, forms) with no systematic way to know how accurate they actually are. So I built doceval — point it at your extractor function + a labeled dataset and get back field-level accuracy, a failure taxonomy (missed_field / hallucination / wrong_format / wrong_value), and optional per-document cost tracking.

Works with any extractor (Claude, GPT, regex, rules) and any document schema. One JSON label file per document, one Python function, one CLI command.

Includes a working 20-document invoice example with a Claude Haiku extractor so you can run it immediately.

source & further reading

dev.to — original article OpenAI GPT-5.6 Launch Puts Speed and Multi-Agent Work at the Center I built a tool that explains any confusing document in plain words (your level, your language) Smallest.ai Raises $13M to Split Voice Agents in Two

~/api · this article 200

$curl api.wpnews.pro/v1/news/doceval-eval-harness-for…

Read original on dev.to → dev.to/dave8172/show-hn-doceval-eval-harness-for…

mentioned entities

doceval

Claude Haiku

GPT

metadata

slugdoceval-eval-harness-for-llm-document-extraction-pipelines

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevOpenAI spending hits $34 billion…

next →Same Prompt, Four AI Tools, One …

── more in #large-language-models 4 stories · sorted by recency

marktechpost.com · 1 Aug · #large-language-models

Supabase Releases Evals: an Open Source Benchmark That Scores Claude Code, Codex and OpenCode on Real Supabase Tasks

dev.to · 1 Aug · #large-language-models

OpenAI GPT-5.6 Launch Puts Speed and Multi-Agent Work at the Center

businessinsider.com · 1 Aug · #large-language-models

This OpenAI product manager always asks ChatGPT to 'impress' him. Here’s his advice for AI prompts.

byteiota.com · 1 Aug · #large-language-models

Claude Opus 5: What Developers Need to Know Now (2026)

── more on @doceval 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required