{"slug": "vortex-multi-modal-fusion-system-for-intelligent-video-retrieval", "title": "Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval", "summary": "FocusOnFun team's Vortex system achieved 90.5% in the Preliminary Round and 'Excellent' overall performance with 'Outstanding' QA results at the Ho Chi Minh City AI Challenge 2025. The multimodal video retrieval system integrates CLIP and SigLIP2 embeddings via Reciprocal Rank Fusion, using Milvus and Elasticsearch for scalable indexing.", "body_md": "arXiv:2606.19682v1 Announce Type: new\nAbstract: This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team's system achieved a score of 79.6/88 (90.5\\%) in the Preliminary Round and was further evaluated in the Final Round, achieving an `Excellent' overall performance with `Outstanding' results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.", "url": "https://wpnews.pro/news/vortex-multi-modal-fusion-system-for-intelligent-video-retrieval", "canonical_source": "https://arxiv.org/abs/2606.19682", "published_at": "2026-06-19 04:00:00+00:00", "updated_at": "2026-06-19 04:01:18.290850+00:00", "lang": "en", "topics": ["computer-vision", "natural-language-processing", "ai-research", "machine-learning", "ai-products"], "entities": ["FocusOnFun", "Ho Chi Minh City AI Challenge 2025", "Vortex", "CLIP", "SigLIP2", "Milvus", "Elasticsearch", "Reciprocal Rank Fusion"], "alternates": {"html": "https://wpnews.pro/news/vortex-multi-modal-fusion-system-for-intelligent-video-retrieval", "markdown": "https://wpnews.pro/news/vortex-multi-modal-fusion-system-for-intelligent-video-retrieval.md", "text": "https://wpnews.pro/news/vortex-multi-modal-fusion-system-for-intelligent-video-retrieval.txt", "jsonld": "https://wpnews.pro/news/vortex-multi-modal-fusion-system-for-intelligent-video-retrieval.jsonld"}}