cd /news/generative-ai/head-to-head-cogvideox-5b-vs-happy-h… · home topics generative-ai article
[ARTICLE · art-47443] src=runtimewire.com ↗ pub= topic=generative-ai verified=true sentiment=↓ negative

Head to head: CogVideoX-5B vs Happy Horse 1.1 Image to Video

Happy Horse 1.1 Image to Video defeated CogVideoX-5B in a head-to-head comparison, scoring 16.9 to 0.5 across two video generation tasks. CogVideoX-5B failed to produce coherent scenes for either prompt, while Happy Horse consistently generated usable footage with readable action and scene construction.

read3 min views1 publishedJul 3, 2026
Head to head: CogVideoX-5B vs Happy Horse 1.1 Image to Video
Image: Runtimewire (auto-discovered)

This wasn’t a close contest. The aggregate score says it plainly: 16.9 to 0.5 in favor of Happy Horse 1.1 Image to Video. More importantly, the footage backs it up. Happy Horse produced actual scenes with readable action; CogVideoX-5B mostly produced absence.

In Saffron Bowl Spill, CogVideoX-5B failed at the most basic level: instead of a cat knocking over a bowl of batter in a warm dawn kitchen, it looked like a uniform beige screen with no discernible scene or motion. Happy Horse, by contrast, actually staged the prompt: the cat, the bowl, the spill, the flour puff, the spoon, the blueberries, and the warm kitchen light all register on screen, with a coherent sequence of events. Some motion is a touch stylized, but stylized beats nonexistent every time.

The gap was just as stark in Hallway Lantern Return. CogVideoX-5B again effectively no-showed, rendering frames that were essentially black. Happy Horse delivered the blackout hallway, the child with the lantern, the toy fire engine, the grandfather, and the hanging plant, while preserving the prompt’s reassuring emotional tone. That matters: this wasn’t just object recall, but scene construction, lighting control, and temporal coherence.

What sinks CogVideoX-5B here is not subtle artifacting or slightly weak motion physics. It’s total prompt failure. You can forgive a model for imperfect continuity; you cannot forgive it for not generating the scene at all. Happy Horse 1.1 Image to Video may not be flawless, but it consistently clears the threshold that makes a video model usable.

Final call: Happy Horse 1.1 Image to Video wins decisively. CogVideoX-5B didn’t lose on polish; it lost on basic visibility and prompt execution.

How they were tested

We ran 2 fresh video tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. CogVideoX-5B scored 0.5 to Happy Horse 1.1 Image to Video's 16.9.

1. Saffron Bowl Spill

In a sunlit apartment kitchen at blue-gold dawn, a tabby cat’s tail clips a ceramic mixing bowl and a ribbon of saffron-tinted batter sloshes over the rim, droplets arcing onto a wooden counter while a thin veil of flour puffs up and hangs in the air; the camera makes a slow lateral dolly from the espresso machine toward the spill, staying close to counter height as the liquid folds, splashes, and runs naturally around a scattered teaspoon and three blueberries, warm window light catching every particle, with a cozy but slightly chaotic mood, 16:9

Winner: Happy Horse 1.1 Image to Video — Model A appears to be a uniform beige screen with no discernible scene or motion, failing the prompt entirely. Model B clearly depicts the cat, bowl, batter spill, flour puff, spoon, blueberries, warm dawn kitchen lighting, and coherent temporal progression, though some object motion and continuity are slightly stylized rather than fully natural.

2. Hallway Lantern Return

One continuous shot inside a narrow family hallway during a summer blackout: starting low behind a child in mismatched socks carrying a small battery lantern, the camera slowly tracks backward in front of her as she walks from the dim laundry nook toward the living room, pausing to nudge a rolling toy fire engine with her foot while her grandfather in the background reaches to steady a swaying hanging plant; soft amber lantern light and faint moonlight from a side window shape the scene, shadows shifting across framed drawings as the moment evolves with a hushed, reassuring mood, 16:9

Winner: Happy Horse 1.1 Image to Video — Model A appears essentially black across all sampled frames and fails to depict the prompt. Model B clearly matches the hallway blackout scene with the child, lantern, toy fire engine, grandfather, and hanging plant, while maintaining coherent motion, lighting, and a reassuring mood.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

── more in #generative-ai 4 stories · sorted by recency
── more on @cogvideox-5b 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/head-to-head-cogvide…] indexed:0 read:3min 2026-07-03 ·