cd /news/machine-learning/cineorchestra-unified-entity-centric… · home topics machine-learning article
[ARTICLE · art-27499] src=arxiv.org ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Researchers introduced CineOrchestra, a unified video diffusion model that simultaneously controls subjects, events, cameras, and shot transitions for cinematic video generation. The model uses entity-centric conditioning with novel rotary positional embeddings to handle heterogeneous cinematic elements, outperforming six per-axis specialists on new benchmarks.

read1 min publishedJun 15, 2026

arXiv:2606.13768v1 Announce Type: new Abstract: Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/cineorchestra-unifie…] indexed:0 read:1min 2026-06-15 ·