{"slug": "can-ai-draw-science-a-benchmark-for-evaluating-scientific-figure-generation-by", "title": "Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models", "summary": "Researchers introduced SciDraw-Bench, a benchmark for evaluating scientific figure generation by text-to-image and multimodal models, covering 32 tasks across eight figure types and ten disciplines. In a pilot study, a domain-specific system, SciDraw AI, substantially outperformed general-purpose models on all evaluation dimensions, with text fidelity remaining the hardest challenge.", "body_md": "arXiv:2606.28406v1 Announce Type: new\nAbstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing image-generation benchmarks (e.g., GenEval, T2I-CompBench, DPG-Bench) evaluate natural images and measure compositionality, object counting, or photorealism. None of them measure what makes a generated scientific figure usable: correct and legible text labels, faithful depiction of entities and their relations, coherent diagrammatic structure, and adherence to disciplinary drawing conventions. We introduce SciDraw-Bench, a benchmark of 32 structured scientific-figure generation tasks spanning eight figure types and ten disciplines, where each task pairs a natural-language prompt with a machine-checkable specification of required labels, relations, components, conventions, and negative constraints. We propose a four-dimensional evaluation protocol: Text Fidelity (OCR-based label recall and character error rate), Semantic Correctness (vision-language-model judging against the specification), Structural Quality, and Convention Adherence, together with a meta-evaluation protocol and a preliminary inter-judge reliability analysis (human-rating validation is ongoing). We evaluate a domain-specific system, SciDraw AI, against representative general-purpose text-to-image models, and outline a code-to-figure baseline as a planned extension. In a pilot over all eight figure types, the domain-specific system substantially outperforms the general-purpose baselines on every dimension and figure type, with the largest gaps on semantic correctness and convention adherence; text fidelity remains the hardest dimension for all systems.", "url": "https://wpnews.pro/news/can-ai-draw-science-a-benchmark-for-evaluating-scientific-figure-generation-by", "canonical_source": "https://arxiv.org/abs/2606.28406", "published_at": "2026-06-30 04:00:00+00:00", "updated_at": "2026-06-30 04:29:07.500697+00:00", "lang": "en", "topics": ["generative-ai", "artificial-intelligence", "machine-learning", "computer-vision", "natural-language-processing"], "entities": ["SciDraw-Bench", "SciDraw AI", "GenEval", "T2I-CompBench", "DPG-Bench"], "alternates": {"html": "https://wpnews.pro/news/can-ai-draw-science-a-benchmark-for-evaluating-scientific-figure-generation-by", "markdown": "https://wpnews.pro/news/can-ai-draw-science-a-benchmark-for-evaluating-scientific-figure-generation-by.md", "text": "https://wpnews.pro/news/can-ai-draw-science-a-benchmark-for-evaluating-scientific-figure-generation-by.txt", "jsonld": "https://wpnews.pro/news/can-ai-draw-science-a-benchmark-for-evaluating-scientific-figure-generation-by.jsonld"}}