{"slug": "gap3d-generative-alignment-of-vlm-latents-to-patch-level-embeddings-for-3d", "title": "GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation", "summary": "Researchers have developed GAP3D, a diffusion-based method that aligns vision-language model latents directly to patch-level image embeddings, enabling frozen generative models to use VLMs as prompt encoders without expensive end-to-end training. The approach, tested on 3D asset generation, trains primarily on general-domain image-text pairs and demonstrates emergent zero-shot capabilities for multimodal prompts despite text-only training. GAP3D represents an initial step toward modular integration of foundation models by partially bridging the representation gap between VLM and image-encoder feature spaces through generative alignment to dense embedding spaces.", "body_md": "arXiv:2605.28995v1 Announce Type: new\nAbstract: Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation. To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs. It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.", "url": "https://wpnews.pro/news/gap3d-generative-alignment-of-vlm-latents-to-patch-level-embeddings-for-3d", "canonical_source": "https://arxiv.org/abs/2605.28995", "published_at": "2026-05-29 04:00:00+00:00", "updated_at": "2026-05-29 04:15:35.343185+00:00", "lang": "en", "topics": ["generative-ai", "computer-vision", "machine-learning", "artificial-intelligence", "neural-networks"], "entities": ["GAP3D", "VLM", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/gap3d-generative-alignment-of-vlm-latents-to-patch-level-embeddings-for-3d", "markdown": "https://wpnews.pro/news/gap3d-generative-alignment-of-vlm-latents-to-patch-level-embeddings-for-3d.md", "text": "https://wpnews.pro/news/gap3d-generative-alignment-of-vlm-latents-to-patch-level-embeddings-for-3d.txt", "jsonld": "https://wpnews.pro/news/gap3d-generative-alignment-of-vlm-latents-to-patch-level-embeddings-for-3d.jsonld"}}