{"slug": "exploring-multi-modal-large-language-models-and-two-stage-fine-tuning-for-image", "title": "Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval", "summary": "Researchers propose a new framework integrating LLaVA and two-stage fine-tuning to improve composed image retrieval in fashion, addressing data scarcity and negative sampling limitations. The method enhances contrastive learning and compositional reasoning, showing improved fine-grained retrieval performance.", "body_md": "arXiv:2606.19684v1 Announce Type: new\nAbstract: Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.", "url": "https://wpnews.pro/news/exploring-multi-modal-large-language-models-and-two-stage-fine-tuning-for-image", "canonical_source": "https://arxiv.org/abs/2606.19684", "published_at": "2026-06-19 04:00:00+00:00", "updated_at": "2026-06-19 04:01:24.029545+00:00", "lang": "en", "topics": ["large-language-models", "computer-vision", "machine-learning"], "entities": ["LLaVA", "CLIP-ViT/B32"], "alternates": {"html": "https://wpnews.pro/news/exploring-multi-modal-large-language-models-and-two-stage-fine-tuning-for-image", "markdown": "https://wpnews.pro/news/exploring-multi-modal-large-language-models-and-two-stage-fine-tuning-for-image.md", "text": "https://wpnews.pro/news/exploring-multi-modal-large-language-models-and-two-stage-fine-tuning-for-image.txt", "jsonld": "https://wpnews.pro/news/exploring-multi-modal-large-language-models-and-two-stage-fine-tuning-for-image.jsonld"}}