DocumentAI Visual Benchmark - GPT 5.5, Gemini 3.5, Qwen...

A new benchmark evaluating DocumentAI models on bounding box accuracy shows GPT-5.5 and Gemini 3.5 leading with 67.7% and 67.5% scores respectively, while Qwen, Kimi, and Mistral trail significantly. The test, run on OpenRouter using pages from the FlashAttention-3 paper, measures how well models return bounding box coordinates for extracted fields, with scores ranging from 67.7% down to 5%.

documentai bbox benchmark In my previous post ../documentai/ , I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part. For the Attention v3 paper from the ExtractBench https://github.com/ContextualAI/extract-bench dataset, ExtractBench focused only on extraction, but I am also interested in the bounding box reference that the models return. Because ExtractBench had only a very limited selection of models without any open-weight ones among them, I ran a few extractions via OpenRouter especially to see how well Qwen, Kimi, and Mistral are doing. So I took pages 1 and 13 from the FlashAttention-3 https://arxiv.org/pdf/2407.08608 example from there and added "reference" bounding boxes with pdfplumber https://github.com/jsvine/pdfplumber it is a native PDF as a reference. They are not perfect, but for a rough indication they are more than enough. leaderboard 67.7% 100% 67.5% 100% 66.1% 100% 61.5% 100% 40.4% 100% 38.9% 90.6% 34% 100% 30.6% 100% 21.8% 90.6% 17.9% 100% 17.8% 43.8% 10.8% 100% 5% 76% for some models I did not manage to generate extraction and bbox in one run. For these I ran separate extraction + bbox prompts. Note: For some models I could not really get consistent scores on OpenRouter even after several runs. The bbox score is a bit over-engineered with coverage for how many fields were bboxes generated? , intersection-over-union to check how well the bbox "fits" the original one, also known as the Jaccard index https://en.wikipedia.org/wiki/Jaccard index , and centroid distance to check if the bbox is roughly in the correct area : prompts ONE SHOT SYSTEM PROMPT = "Return only valid JSON matching the provided JSON Schema." one shot user prompt = f""" Only use the provided page images. They are not necessarily consecutive pages. The original PDF has 22 pages. If the schema asks for number of pages, use 22. Page mapping: - input image 1: original PDF page 1, page index 0 - input image 2: original PDF page 13, page index 12 Each scalar extraction field is an object with value and bbox. Use bbox null when the value is not visible in the provided page images. Boxes are x1, y1, x2, y2 . JSON Schema: {annotated extraction schema json} """ I modified the original JSON schema a bit and added an additional bbox field to every value. See the example for the ids field: { "ids": { "value": { "type": "string", "null" }, "bbox": { "type": "object", "null" , "properties": { "page index": { "type": "integer" }, "box": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 } }, "required": "page index", "box" } } }