{"slug": "fine-tuning-vision-language-models-for-understanding-current-damage-and-scoring", "title": "Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent", "summary": "Researchers in Japan fine-tuned the LLaVA-1.5-7B vision-language model on up to 4,000 bridge damage images to automate damage assessment and repair priority scoring, addressing significant inter-rater variability in mandatory five-year visual inspections. The study found that 2,000 training samples achieved near-optimal validation loss in 2.9 hours, with quality-curated mid-scale data outperforming larger, noisier corpora. A two-stage Quality Guard using a Swallow-8B small language model rejects low-quality outputs before scoring, reducing inference time by 70.2% to 10.06 seconds per image.", "body_md": "arXiv:2605.27452v1 Announce Type: new\nAbstract: Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability -- a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs).\nWe fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining torch.compile() and batch processing (batch_size=8) achieves 10.06 seconds per image -- a 70.2% reduction over the unoptimized baseline.\nOur approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.", "url": "https://wpnews.pro/news/fine-tuning-vision-language-models-for-understanding-current-damage-and-scoring", "canonical_source": "https://arxiv.org/abs/2605.27452", "published_at": "2026-05-28 04:00:00+00:00", "updated_at": "2026-05-28 04:26:52.886107+00:00", "lang": "en", "topics": ["computer-vision", "large-language-models", "machine-learning", "artificial-intelligence", "ai-research"], "entities": ["LLaVA-1.5-7B", "QLoRA", "Japan"], "alternates": {"html": "https://wpnews.pro/news/fine-tuning-vision-language-models-for-understanding-current-damage-and-scoring", "markdown": "https://wpnews.pro/news/fine-tuning-vision-language-models-for-understanding-current-damage-and-scoring.md", "text": "https://wpnews.pro/news/fine-tuning-vision-language-models-for-understanding-current-damage-and-scoring.txt", "jsonld": "https://wpnews.pro/news/fine-tuning-vision-language-models-for-understanding-current-damage-and-scoring.jsonld"}}