{"slug": "documentai-visual-benchmark-gpt-5-5-gemini-3-5-qwen", "title": "DocumentAI Visual Benchmark - GPT 5.5, Gemini 3.5, Qwen...", "summary": "A new benchmark evaluating DocumentAI models on bounding box accuracy shows GPT-5.5 and Gemini 3.5 leading with 67.7% and 67.5% scores respectively, while Qwen, Kimi, and Mistral trail significantly. The test, run on OpenRouter using pages from the FlashAttention-3 paper, measures how well models return bounding box coordinates for extracted fields, with scores ranging from 67.7% down to 5%.", "body_md": "# # documentai bbox benchmark\n\nIn my [previous post](../documentai/), I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part. For the Attention v3 paper from the [ExtractBench](https://github.com/ContextualAI/extract-bench) dataset, ExtractBench focused only on extraction, but I am also interested in the bounding box reference that the models return.\n\nBecause ExtractBench had only a very limited selection of models without any open-weight ones among them, I ran a few extractions via `OpenRouter`\n\nespecially to see how well Qwen, Kimi, and Mistral are doing. So I took pages 1 and 13 from the [FlashAttention-3](https://arxiv.org/pdf/2407.08608) example from there and added \"reference\" bounding boxes with [pdfplumber](https://github.com/jsvine/pdfplumber) (it is a native PDF) as a reference. They are not perfect, but for a rough indication they are more than enough.\n\n## leaderboard\n\n**67.7%**\n\n*100%*\n\n**67.5%**\n\n*100%*\n\n**66.1%**\n\n*100%*\n\n**61.5%**\n\n*100%*\n\n**40.4%**\n\n*100%*\n\n**38.9%**\n\n*90.6%**\n\n**34%**\n\n*100%*\n\n**30.6%**\n\n*100%*\n\n**21.8%**\n\n*90.6%**\n\n**17.9%**\n\n*100%*\n\n**17.8%**\n\n*43.8%*\n\n**10.8%**\n\n*100%*\n\n**5%**\n\n*76%*\n\n* for some models I did not manage to generate extraction and bbox in one run. For these I ran separate extraction + bbox prompts.\n\n**Note: For some models I could not really get consistent scores on OpenRouter even after several runs.**\n\nThe bbox score is a bit over-engineered with coverage (for how many fields were bboxes generated?), intersection-over-union (to check how well the bbox \"fits\" the original one, also known as the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)), and centroid distance (to check if the bbox is roughly in the correct area):\n\n## prompts\n\n```\nONE_SHOT_SYSTEM_PROMPT = \"Return only valid JSON matching the provided JSON Schema.\"\n\none_shot_user_prompt = f\"\"\"\nOnly use the provided page images. They are not necessarily consecutive pages.\nThe original PDF has 22 pages. If the schema asks for number_of_pages, use 22.\nPage mapping:\n- input image 1: original PDF page 1, page_index 0\n- input image 2: original PDF page 13, page_index 12\n\nEach scalar extraction field is an object with value and bbox. Use bbox null when\nthe value is not visible in the provided page images. Boxes are [x1, y1, x2, y2].\n\nJSON Schema:\n{annotated_extraction_schema_json}\n\"\"\"\n```\n\nI modified the original JSON schema a bit and added an additional `bbox`\n\nfield to every value. See the example for the `ids`\n\nfield:\n\n```\n{\n  \"ids\": {\n    \"value\": {\n      \"type\": [\"string\", \"null\"]\n    },\n    \"bbox\": {\n      \"type\": [\"object\", \"null\"],\n      \"properties\": {\n        \"page_index\": {\n          \"type\": \"integer\"\n        },\n        \"box\": {\n          \"type\": \"array\",\n          \"items\": {\n            \"type\": \"number\"\n          },\n          \"minItems\": 4,\n          \"maxItems\": 4\n        }\n      },\n      \"required\": [\"page_index\", \"box\"]\n    }\n  }\n}\n```\n\n", "url": "https://wpnews.pro/news/documentai-visual-benchmark-gpt-5-5-gemini-3-5-qwen", "canonical_source": "https://www.maltebuettner.eu/posts/documentai-bbox-benchmark", "published_at": "2026-05-30 12:32:36+00:00", "updated_at": "2026-05-30 13:16:37.972725+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-research", "computer-vision"], "entities": ["OpenRouter", "Qwen", "Kimi", "Mistral", "FlashAttention-3", "ExtractBench", "pdfplumber", "ContextualAI"], "alternates": {"html": "https://wpnews.pro/news/documentai-visual-benchmark-gpt-5-5-gemini-3-5-qwen", "markdown": "https://wpnews.pro/news/documentai-visual-benchmark-gpt-5-5-gemini-3-5-qwen.md", "text": "https://wpnews.pro/news/documentai-visual-benchmark-gpt-5-5-gemini-3-5-qwen.txt", "jsonld": "https://wpnews.pro/news/documentai-visual-benchmark-gpt-5-5-gemini-3-5-qwen.jsonld"}}