{"slug": "reasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven", "title": "Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins", "summary": "Researchers propose OR3, a text-to-video retrieval method for operating room clips that converts videos into action-driven digital twins and uses an LLM to generate hypothetical queries for intra-modal matching. The method achieves 57.6% R@1 and 77.3% R@5 on a benchmark of 276 implicit queries over 386 clips from robotic knee procedures, outperforming existing baselines.", "body_md": "arXiv:2606.17298v1 Announce Type: new\nAbstract: Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.", "url": "https://wpnews.pro/news/reasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven", "canonical_source": "https://arxiv.org/abs/2606.17298", "published_at": "2026-06-17 04:00:00+00:00", "updated_at": "2026-06-17 04:25:37.943763+00:00", "lang": "en", "topics": ["computer-vision", "natural-language-processing", "large-language-models", "ai-research"], "entities": ["OR3", "MM-OR", "LLM", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/reasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven", "markdown": "https://wpnews.pro/news/reasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven.md", "text": "https://wpnews.pro/news/reasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven.txt", "jsonld": "https://wpnews.pro/news/reasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven.jsonld"}}