Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

A systematic review of 97 studies from 2020 to 2026 found that large AI models in dentistry fall into three categories—language-generative, discriminative vision, and dental-specific foundation models—each with distinct strengths and weaknesses. Language models excel at text-based tasks like clinical reasoning but struggle with image diagnostics, while dental-specific models such as DentVFM and OralGPT outperform general-purpose systems on complex multimodal tasks. The findings highlight that safe autonomous deployment remains hindered by generative model hallucination, scarce annotated dental data, and a lack of standardized clinical benchmarks.

arXiv:2606.02914v1 Announce Type: new Abstract: Background: Oral diseases affect nearly 3.5 billion people worldwide, yet the comparative clinical potential of large-scale AI models in dentistry remains poorly understood. Three distinct model categories have emerged: language-generative models, discriminative vision foundation models, and dental-specific foundation models, with no unified review examining their relationships and collective limitations. Methods: Following PRISMA-ScR guidelines, we systematically searched four databases PubMed, Google Scholar, Scopus, arXiv , screened independently by two reviewers. After applying inclusion/exclusion criteria, 97 studies 2020-2026 were included. We propose a two-dimensional classification framework organizing models by architectural paradigm and dental specialization degree. Results: Language-generative models excel at text-based tasks clinical reasoning, licensing exams, patient communication but show inconsistent performance on image-dependent diagnostics. Adapted SAM and CLIP variants achieve strong tooth segmentation and lesion detection results. Dental-specific models DentVFM, DentVLM, OralGPT demonstrate strongest performance on complex multimodal tasks. Integrated pipelines consistently outperform single-model approaches. A data asymmetry is observed: dental-specific pretraining concentrates almost entirely in the vision domain, reflecting scarce large-scale dental text corpora. Conclusions: General-purpose and dental-specific models play complementary roles; the most effective systems combine both within structured pipelines. Safe autonomous deployment requires resolving three persistent barriers: hallucination in generative models, limited annotated dental datasets, and absent standardized clinical evaluation benchmarks.