CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

wpnews.pro

cd /news/machine-learning/cineorchestra-unified-entity-centric… · home › topics › machine-learning › article

[ARTICLE · art-27499] src=arxiv.org ↗ pub=2026-06-15T04:00Z topic=machine-learning verified=true sentiment=↑ positive

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Researchers introduced CineOrchestra, a unified video diffusion model that simultaneously controls subjects, events, cameras, and shot transitions for cinematic video generation. The model uses entity-centric conditioning with novel rotary positional embeddings to handle heterogeneous cinematic elements, outperforming six per-axis specialists on new benchmarks.

read1 min publishedJun 15, 2026

arXiv:2606.13768v1 Announce Type: new Abstract: Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/cineorchestra-unified-en…

Read original on arxiv.org → arxiv.org/abs/2606.13768

mentioned entities

CineOrchestra

metadata

slugcineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation

topic#machine-learning

secondary3 topics

sentimentpositive

langen

canonicalarxiv.org

navigation

← prevDomain-Specific AI for Pharma, B…

next →5 Claude Automation Tricks That …

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 15 Jun · #machine-learning

Connections Between Pairs of Filters Improve the Accuracy of Convolutional Neural Networks

arxiv.org · 15 Jun · #machine-learning

Compressing Image Style Training into a Single Model Forward

arxiv.org · 15 Jun · #machine-learning

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

arxiv.org · 15 Jun · #machine-learning

Temporal Backtracking Search for Test-time Generative Video Reasoning

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required