cd /news/computer-vision/reasoning-text-to-video-retrieval-fo… · home topics computer-vision article
[ARTICLE · art-30512] src=arxiv.org ↗ pub= topic=computer-vision verified=true sentiment=↑ positive

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

Researchers propose OR3, a text-to-video retrieval method for operating room clips that converts videos into action-driven digital twins and uses an LLM to generate hypothetical queries for intra-modal matching. The method achieves 57.6% R@1 and 77.3% R@5 on a benchmark of 276 implicit queries over 386 clips from robotic knee procedures, outperforming existing baselines.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17298v1 Announce Type: new Abstract: Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

── more in #computer-vision 4 stories · sorted by recency
── more on @or3 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/reasoning-text-to-vi…] indexed:0 read:1min 2026-06-17 ·