cd /news/large-language-models/documentai-bbox-benchmark · home topics large-language-models article
[ARTICLE · art-18568] src=maltebuettner.eu pub= topic=large-language-models verified=true sentiment=· neutral

documentai bbox benchmark

Malte Buettner benchmarked bounding box accuracy for Document AI models using pages from the FlashAttention-3 paper, testing Qwen, Kimi, and Mistral via OpenRouter. The evaluation scored models on coverage, intersection-over-union, and centroid distance to measure how well generated bounding boxes matched reference boxes from pdfplumber. Results showed inconsistent scores across multiple runs for some models, highlighting variability in open-weight model performance for document extraction tasks.

read2 min publishedMay 14, 2026

In my previous post, I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part. For the Attention v3 paper from the ExtractBench dataset, ExtractBench focused only on extraction, but I am also interested in the bounding box reference that the models return.

Because ExtractBench had only a very limited selection of models without any open-weight ones among them, I ran a few extractions via OpenRouter

especially to see how well Qwen, Kimi, and Mistral are doing. So I took pages 1 and 13 from the FlashAttention-3 example from there and added "reference" bounding boxes with pdfplumber (it is a native PDF) as a reference. They are not perfect, but for a rough indication they are more than enough.

  • for some models I did not manage to generate extraction and bbox in one run. For these I ran separate extraction + bbox prompts.

Note: For some models I could not really get consistent scores on OpenRouter even after several runs.

The bbox score is a bit over-engineered with coverage (for how many fields were bboxes generated?), intersection-over-union (to check how well the bbox "fits" the original one, also known as the Jaccard index), and centroid distance (to check if the bbox is roughly in the correct area):

coverage×(0.5×mean IoU+0.5×centroid score)

ONE_SHOT_SYSTEM_PROMPT = "Return only valid JSON matching the provided JSON Schema."

one_shot_user_prompt = f"""
Only use the provided page images. They are not necessarily consecutive pages.
The original PDF has 22 pages. If the schema asks for number_of_pages, use 22.
Page mapping:
- input image 1: original PDF page 1, page_index 0
- input image 2: original PDF page 13, page_index 12

Each scalar extraction field is an object with value and bbox. Use bbox null when
the value is not visible in the provided page images. Boxes are [x1, y1, x2, y2].

JSON Schema:
{annotated_extraction_schema_json}
"""

I modified the original JSON schema a bit and added an additional bbox

field to every value. See the example for the ids

field:

{
  "ids": {
    "value": {
      "type": ["string", "null"]
    },
    "bbox": {
      "type": ["object", "null"],
      "properties": {
        "page_index": {
          "type": "integer"
        },
        "box": {
          "type": "array",
          "items": {
            "type": "number"
          },
          "minItems": 4,
          "maxItems": 4
        }
      },
      "required": ["page_index", "box"]
    }
  }
}
── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/documentai-bbox-benc…] indexed:0 read:2min 2026-05-14 ·