{"slug": "cineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation", "title": "CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation", "summary": "Researchers introduced CineOrchestra, a unified video diffusion model that simultaneously controls subjects, events, cameras, and shot transitions for cinematic video generation. The model uses entity-centric conditioning with novel rotary positional embeddings to handle heterogeneous cinematic elements, outperforming six per-axis specialists on new benchmarks.", "body_md": "arXiv:2606.13768v1 Announce Type: new\nAbstract: Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.", "url": "https://wpnews.pro/news/cineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation", "canonical_source": "https://arxiv.org/abs/2606.13768", "published_at": "2026-06-15 04:00:00+00:00", "updated_at": "2026-06-15 04:13:08.492191+00:00", "lang": "en", "topics": ["machine-learning", "computer-vision", "generative-ai", "ai-research"], "entities": ["CineOrchestra"], "alternates": {"html": "https://wpnews.pro/news/cineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation", "markdown": "https://wpnews.pro/news/cineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation.md", "text": "https://wpnews.pro/news/cineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation.txt", "jsonld": "https://wpnews.pro/news/cineorchestra-unified-entity-centric-conditioning-for-cinematic-video-generation.jsonld"}}