# DocumentAI Visual Benchmark - GPT 5.5, Gemini 3.5, Qwen...

> Source: <https://www.maltebuettner.eu/posts/documentai-bbox-benchmark>
> Published: 2026-05-30 12:32:36+00:00

# # documentai bbox benchmark

In my [previous post](../documentai/), I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part. For the Attention v3 paper from the [ExtractBench](https://github.com/ContextualAI/extract-bench) dataset, ExtractBench focused only on extraction, but I am also interested in the bounding box reference that the models return.

Because ExtractBench had only a very limited selection of models without any open-weight ones among them, I ran a few extractions via `OpenRouter`

especially to see how well Qwen, Kimi, and Mistral are doing. So I took pages 1 and 13 from the [FlashAttention-3](https://arxiv.org/pdf/2407.08608) example from there and added "reference" bounding boxes with [pdfplumber](https://github.com/jsvine/pdfplumber) (it is a native PDF) as a reference. They are not perfect, but for a rough indication they are more than enough.

## leaderboard

**67.7%**

*100%*

**67.5%**

*100%*

**66.1%**

*100%*

**61.5%**

*100%*

**40.4%**

*100%*

**38.9%**

*90.6%**

**34%**

*100%*

**30.6%**

*100%*

**21.8%**

*90.6%**

**17.9%**

*100%*

**17.8%**

*43.8%*

**10.8%**

*100%*

**5%**

*76%*

* for some models I did not manage to generate extraction and bbox in one run. For these I ran separate extraction + bbox prompts.

**Note: For some models I could not really get consistent scores on OpenRouter even after several runs.**

The bbox score is a bit over-engineered with coverage (for how many fields were bboxes generated?), intersection-over-union (to check how well the bbox "fits" the original one, also known as the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)), and centroid distance (to check if the bbox is roughly in the correct area):

## prompts

```
ONE_SHOT_SYSTEM_PROMPT = "Return only valid JSON matching the provided JSON Schema."

one_shot_user_prompt = f"""
Only use the provided page images. They are not necessarily consecutive pages.
The original PDF has 22 pages. If the schema asks for number_of_pages, use 22.
Page mapping:
- input image 1: original PDF page 1, page_index 0
- input image 2: original PDF page 13, page_index 12

Each scalar extraction field is an object with value and bbox. Use bbox null when
the value is not visible in the provided page images. Boxes are [x1, y1, x2, y2].

JSON Schema:
{annotated_extraction_schema_json}
"""
```

I modified the original JSON schema a bit and added an additional `bbox`

field to every value. See the example for the `ids`

field:

```
{
  "ids": {
    "value": {
      "type": ["string", "null"]
    },
    "bbox": {
      "type": ["object", "null"],
      "properties": {
        "page_index": {
          "type": "integer"
        },
        "box": {
          "type": "array",
          "items": {
            "type": "number"
          },
          "minItems": 4,
          "maxItems": 4
        }
      },
      "required": ["page_index", "box"]
    }
  }
}
```


