Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

wpnews.pro

cd /news/computer-vision/reasoning-text-to-video-retrieval-fo… · home › topics › computer-vision › article

[ARTICLE · art-30512] src=arxiv.org ↗ pub=2026-06-17T04:00Z topic=computer-vision verified=true sentiment=↑ positive

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

Researchers propose OR3, a text-to-video retrieval method for operating room clips that converts videos into action-driven digital twins and uses an LLM to generate hypothetical queries for intra-modal matching. The method achieves 57.6% R@1 and 77.3% R@5 on a benchmark of 276 implicit queries over 386 clips from robotic knee procedures, outperforming existing baselines.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17298v1 Announce Type: new Abstract: Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/reasoning-text-to-video-…

Read original on arxiv.org → arxiv.org/abs/2606.17298

mentioned entities

OR3

MM-OR

LLM

arXiv

metadata

slugreasoning-text-to-video-retrieval-for-operating-room-clips-via-action-driven

topic#computer-vision

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevRay Data LLM enables 2x throughp…

next →Trust Begins with DNS: Mitigatin…

── more in #computer-vision 4 stories · sorted by recency

arxiv.org · 17 Jun · #computer-vision

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

arxiv.org · 17 Jun · #computer-vision

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

arxiv.org · 17 Jun · #computer-vision

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

arxiv.org · 17 Jun · #computer-vision

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

── more on @or3 3 stories trending now

wpnews · 16 Jun · #ai-agents

The LLM Is Not the Final Authority: Building Trust Infrastructure for AI Agents

wpnews · 16 Jun · #artificial-intelligence

Most Businesses Lose Leads at Night — So I Built This

wpnews · 16 Jun · #ai-safety

Researchers propose causal framework to audit synthetic data

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required