documentai bbox benchmark

wpnews.pro

cd /news/large-language-models/documentai-bbox-benchmark · home › topics › large-language-models › article

[ARTICLE · art-18568] src=maltebuettner.eu ↗ pub=2026-05-14T00:00Z topic=large-language-models verified=true sentiment=· neutral

documentai bbox benchmark

Malte Buettner benchmarked bounding box accuracy for Document AI models using pages from the FlashAttention-3 paper, testing Qwen, Kimi, and Mistral via OpenRouter. The evaluation scored models on coverage, intersection-over-union, and centroid distance to measure how well generated bounding boxes matched reference boxes from pdfplumber. Results showed inconsistent scores across multiple runs for some models, highlighting variability in open-weight model performance for document extraction tasks.

read2 min views8 publishedMay 14, 2026

In my previous post, I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part. For the Attention v3 paper from the ExtractBench dataset, ExtractBench focused only on extraction, but I am also interested in the bounding box reference that the models return.

Because ExtractBench had only a very limited selection of models without any open-weight ones among them, I ran a few extractions via OpenRouter

especially to see how well Qwen, Kimi, and Mistral are doing. So I took pages 1 and 13 from the FlashAttention-3 example from there and added "reference" bounding boxes with pdfplumber (it is a native PDF) as a reference. They are not perfect, but for a rough indication they are more than enough.

for some models I did not manage to generate extraction and bbox in one run. For these I ran separate extraction + bbox prompts.

Note: For some models I could not really get consistent scores on OpenRouter even after several runs.

The bbox score is a bit over-engineered with coverage (for how many fields were bboxes generated?), intersection-over-union (to check how well the bbox "fits" the original one, also known as the Jaccard index), and centroid distance (to check if the bbox is roughly in the correct area):

coverage×(0.5×mean IoU+0.5×centroid score)

ONE_SHOT_SYSTEM_PROMPT = "Return only valid JSON matching the provided JSON Schema."

one_shot_user_prompt = f"""
Only use the provided page images. They are not necessarily consecutive pages.
The original PDF has 22 pages. If the schema asks for number_of_pages, use 22.
Page mapping:
- input image 1: original PDF page 1, page_index 0
- input image 2: original PDF page 13, page_index 12

Each scalar extraction field is an object with value and bbox. Use bbox null when
the value is not visible in the provided page images. Boxes are [x1, y1, x2, y2].

JSON Schema:
{annotated_extraction_schema_json}
"""

I modified the original JSON schema a bit and added an additional bbox

field to every value. See the example for the ids

field:

{
  "ids": {
    "value": {
      "type": ["string", "null"]
    },
    "bbox": {
      "type": ["object", "null"],
      "properties": {
        "page_index": {
          "type": "integer"
        },
        "box": {
          "type": "array",
          "items": {
            "type": "number"
          },
          "minItems": 4,
          "maxItems": 4
        }
      },
      "required": ["page_index", "box"]
    }
  }
}

source & further reading

maltebuettner.eu — original article DocumentAI Visual Benchmark - GPT 5.5, Gemini 3.5, Qwen...

~/api · this article 200

$curl api.wpnews.pro/v1/news/documentai-bbox-benchmar…

Read original on maltebuettner.eu → maltebuettner.eu/posts/documentai-bbox-benchmark…

mentioned entities

Malte Buettner

ExtractBench

ContextualAI

OpenRouter

Qwen

Kimi

Mistral

FlashAttention-3

metadata

slugdocumentai-bbox-benchmark

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalmaltebuettner.eu

navigation

← prevWelcome to the Datasette blog

next →Memory Is a Feature. It Is Also …

── more in #large-language-models 4 stories · sorted by recency

maltebuettner.eu · 30 May · #large-language-models

DocumentAI Visual Benchmark - GPT 5.5, Gemini 3.5, Qwen...

machinebrief.com · 14 Jul · #large-language-models

LLMs: The SCOPE-RL Framework's Promising Path

machinebrief.com · 14 Jul · #large-language-models

Innovative Feedback System Boosts RAG Performance

machinebrief.com · 14 Jul · #large-language-models

Cracking Multi-Hop QA: The STEC Framework's Breakthrough

── more on @malte buettner 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required