Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

Researchers propose OR3, a text-to-video retrieval method for operating room clips that converts videos into action-driven digital twins and uses an LLM to generate hypothetical queries for intra-modal matching. The method achieves 57.6% R@1 and 77.3% R@5 on a benchmark of 276 implicit queries over 386 clips from robotic knee procedures, outperforming existing baselines.

arXiv:2606.17298v1 Announce Type: new Abstract: Text-to-video retrieval in operating rooms OR is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video e.g., the step right before clipping . However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins ActDTs , grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.