documentai bbox benchmark

Malte Buettner benchmarked bounding box accuracy for Document AI models using pages from the FlashAttention-3 paper, testing Qwen, Kimi, and Mistral via OpenRouter. The evaluation scored models on coverage, intersection-over-union, and centroid distance to measure how well generated bounding boxes matched reference boxes from pdfplumber. Results showed inconsistent scores across multiple runs for some models, highlighting variability in open-weight model performance for document extraction tasks.

In my previous post https://www.maltebuettner.eu/documentai/ , I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part. For the Attention v3 paper from the ExtractBench https://github.com/ContextualAI/extract-bench dataset, ExtractBench focused only on extraction, but I am also interested in the bounding box reference that the models return. Because ExtractBench had only a very limited selection of models without any open-weight ones among them, I ran a few extractions via OpenRouter especially to see how well Qwen, Kimi, and Mistral are doing. So I took pages 1 and 13 from the FlashAttention-3 https://arxiv.org/pdf/2407.08608 example from there and added "reference" bounding boxes with pdfplumber https://github.com/jsvine/pdfplumber it is a native PDF as a reference. They are not perfect, but for a rough indication they are more than enough. for some models I did not manage to generate extraction and bbox in one run. For these I ran separate extraction + bbox prompts. Note: For some models I could not really get consistent scores on OpenRouter even after several runs. The bbox score is a bit over-engineered with coverage for how many fields were bboxes generated? , intersection-over-union to check how well the bbox "fits" the original one, also known as the Jaccard index https://en.wikipedia.org/wiki/Jaccard index , and centroid distance to check if the bbox is roughly in the correct area : coverage× 0.5×mean IoU+0.5×centroid score ONE SHOT SYSTEM PROMPT = "Return only valid JSON matching the provided JSON Schema." one shot user prompt = f""" Only use the provided page images. They are not necessarily consecutive pages. The original PDF has 22 pages. If the schema asks for number of pages, use 22. Page mapping: - input image 1: original PDF page 1, page index 0 - input image 2: original PDF page 13, page index 12 Each scalar extraction field is an object with value and bbox. Use bbox null when the value is not visible in the provided page images. Boxes are x1, y1, x2, y2 . JSON Schema: {annotated extraction schema json} """ I modified the original JSON schema a bit and added an additional bbox field to every value. See the example for the ids field: { "ids": { "value": { "type": "string", "null" }, "bbox": { "type": "object", "null" , "properties": { "page index": { "type": "integer" }, "box": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 } }, "required": "page index", "box" } } }