It shoots, directs, and edits social videos for your venue โ on autopilot, in every format, without a film crew.
Built for the Yacht Hackathon โ by @ComposioHQ, @nebius, @tavily-ai & @openclaw.
** Watch the 60s demo on X** ยท or play it inline below ๐
Thirty seconds. That's all a viewer gives your venue before they swipe. To win those seconds you need a constant stream of video โ but a human videographer is expensive, slow, and shoots one thing at a time.
Cosmos Claw is the always-on, AI-native alternative: a videographer and a marketing manager that never sleep. Point it at any venue โ a short-let, a cafรฉ, a bar โ and it runs the whole studio on autopilot:
Studies your space โ GPT-4o (vision) labels every photo and learns what each room is.Brands it โ invents the positioning andlocks inthe missing facts (price, story, amenities) so they stay identical across every video.Ideates like a manager โ brainstorms a fresh campaign for each post (angle, hook, photo order, format, music, voice) that's different from everything it has shipped before.Films real motion โ NVIDIA Cosmos 3, aworld model built for robotics, doesn't pan over stills; it generates a first-person POV that physicallywalks into the room.Voices & cuts it โ a unique GPT-writtenvoiceover over a mood-matched music bed, cross-faded into the right aspect ratio.Delivers everywhereโ ready-to-post cards (caption, hashtags, handle, recommended audio) in every feed: Reels/TikTok 9:16, IG 1:1 & 4:5, YouTube 16:9.
No crew, no call sheet, no edit bay. Just your existing photos in โ a full, on-brand social calendar out, on repeat.
As you read this, two workers are filming in parallel โ pumping out a stream of ready-to-post Reels and TikToks for two San Francisco venues at once, each with its own AI voiceover โ all on a Cosmos 3 model we deployed ourselves on Nebius H200s. It s when the network blips and resumes on its own. Truly always-on.
Looking forward to expanding this on the Yacht โ SF is
sooooamazing ๐โต๏ธ
1 โ Raw photos in. Drop a venue's existing images into the project. That's the only input. (Here: the Alamo Square Hacker House โ bedrooms, gym, coworking.)
2 โ The marketing manager's memory. An OpenClaw-style GPT-4o manager researches the venue, locks in a consistent brand (positioning, audience, tone, pitch), writes the voiceover, and picks the assets & order โ the durable memory every video is grounded on.
3 โ Ready-to-post cuts out. The Agent Loop streams everything the videographer does in real time, and each published cut is a ready-to-post package: video + caption + recommended audio (music & voice) + handle, ready to download or push to the channel.
| Layer | What we used |
|---|---|
| ๐ฅ Video model | |
| NVIDIA Cosmos 3 Nano โ a world model (built for robotics/embodied POV), self-deployed by us for first-person walk-throughs | |
| โก Compute | |
| โ NVIDIAยฎ | |
| H200 NVLink GPUs | |
| ๐ง Manager + director | |
| GPT-4o (vision) โ studies the photos, brands the venue, ideates each campaign & storyboard | |
| ๐ Neighborhood research | |
| โ enriches each venue with real local context | |
| ๐บ๏ธ Maps & info cards | |
| OpenStreetMap โ location, transit & nearby spots | |
| ๐ Audio | |
| OpenAI TTS โ a unique per-cut voiceover over a mood-matched music bed | |
| ๐งฉ App | |
FastAPI Studio UI + an always-on marketing_loop driver, FFmpeg for cutting/transitions |
We didn't just call a hosted API โ we stood up Cosmos 3 Nano ourselves on Nebius H200 NVLink GPUs (vLLM-Omni, OpenAI-compatible) and drove it end-to-end. Tavily researches the surrounding neighborhood so every second of the video carries the context a viewer needs to say yes.
Shout-out to the partners: @ship_builders ยท @nebiusai ยท @nvidia ยท @composio ยท @tavilyai ยท @openclaw
venue photos + facts
โ
โผ
GPT-4o manager โโโ brand dossier (positioning + durable assumptions)
โ โ
โ โโ Tavily โโ neighborhood research
โ โโ ideate โโ one fresh campaign (angle ยท photos ยท
โ format ยท music ยท voice ยท caption ยท VO)
โผ
NVIDIA Cosmos 3 Nano โโโ a short first-person POV clip per beat
(world model, self-hosted on Nebius H200)
โ
โผ
transitions + audio โโโ cross-fade ยท GPT voiceover ยท mood music ยท reframe
โ
โผ
ready-to-post cut.mp4 โโโ Agent Loop feed (caption ยท hashtags ยท audio)
โ
โโโโโโโโโโโโโโโ loop: next idea, next venue (in parallel)
| File | Role |
|---|---|
scripts/marketing_loop.py |
|
| The always-on loop: study โ ideate โ film โ voice โ publish, per venue (parallel-safe) | |
scripts/cosmos_montage.py |
|
| Terminal montage: GPT vision per photo โ Cosmos clips โ fast transitions | |
app/marketing_agent.py |
|
| GPT-4o marketing manager: research โ brand โ brief | |
app/brand.py |
|
| Per-venue brand dossier (memory, durable assumptions, social posts) | |
app/main.py |
|
| FastAPI server + Studio UI + generation API | |
app/trailer.py |
|
| GPT-4o "director" โ storyboard, shot/motion + walk-through mode | |
app/generation/cosmos.py |
|
| NVIDIA Cosmos 3 imageโvideo adapter (motion โ flow-shift) | |
app/generation/stub.py |
|
| Free local FFmpeg fallback generator | |
app/transitions.py |
|
| Fast cross-fade montage into any aspect ratio (xfade) | |
app/curation.py |
|
| Best-of-N take scoring (motion energy + stability) | |
app/audio.py |
|
| TTS voiceover + mood music bed + duck-and-mux | |
app/infocards.py |
|
| Map / price / neighborhood cards (OpenStreetMap) | |
app/pipeline.py |
|
| Orchestrates a single run (best-of-N, info beats, finish) | |
app/agent.py |
|
| Terminal CLI to drive the manager + fire renders | |
deploy/tunnel_keeper.sh |
|
| Self-healing SSH tunnel to the Nebius GPU |
Cosmos Claw isn't a one-shot tool โ it's a loop. A persistent brand dossier
(outputs/listing_{id}_brand.json
) is the single source of truth per venue, and an autonomous manager works against it the way a real social-media manager would โ forever:
Studyโ GPT-4o vision builds anasset index: what every uploaded photo is (cached, so it's paid for once).Ideateโ brainstorms ONE fresh campaigndistinct from past themes: the angle, which photos to use (ordered like a story), the social format, music mood, TTS voice, a ready-to-post caption + hashtags, and a ~25svoiceover script.** Film**โ turns the chosen photos into short, first-person Cosmos clips.** Cut**โ cross-fades them into the campaign's aspect ratio and mixes the GPT voiceover over a mood-matched music bed.** Publish**โ drops a ready-to-post card into the** Agent Loop**feed and logs every step to the dossier timeline.
โฆthen it does it again, with a brand-new idea. Run one worker per venue and they generate in parallel, so multiple feeds fill at once:
python scripts/marketing_loop.py --projects la-house-1 --tag la --max-videos 6
python scripts/marketing_loop.py --projects hacker-house --tag hh --max-videos 6
It's built to run unattended: a live endpoint probe before every shot means a
Wi-Fi/tunnel blip just s the shoot and resumes when the connection is back โ
no babysitting, no half-burned campaigns. A self-healing SSH tunnel keeper
(deploy/tunnel_keeper.sh
) keeps the link to the GPU alive underneath.
Consistency is the trick. Whatever the manager makes up, it makes up once:
build_brand
writes the missing facts as durable assumptions that are never overwritten, so price, amenities and host story stay identical across every cut.
study (vision asset index) โโ ideate (fresh campaign) โโ film โโ voice + cut โโ publish โโ
โฒ โ
โโโโโโโโโโโโโโโโโโโโโโโโ grounded on the brand dossier โโโโโโโโโโโโโโโโโโโโโโโโ
Prefer to drive it by hand? The same brain runs from the Agent Loop tab in the UI, or from the terminal:
python -m app.agent list # projects + dossier status
python -m app.agent run la-house-1 --format reel # research โ brand โ brief
python -m app.agent assume la-house-1 price "$245/night" # lock a consistent fact
python -m app.agent generate la-house-1 --format youtube # render via the live API
Formats: reel
, tiktok
, shorts
, story
, snap
(9:16), youtube
(16:9),
square
(1:1), portrait
(4:5). The render canvas switches automatically.
Requires Python 3.9+ and FFmpeg.
cd LiveHere
brew install ffmpeg # one time
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # add your keys (OpenAI, Tavily, Cosmos)
python -m app # โ http://127.0.0.1:8000
Open http://127.0.0.1:8000, pick a listing, tweak the auto-filled details, and hit Generate. With no GPU configured it runs on the free local FFmpeg stub; point it at Cosmos for the real thing (below).
The generation backend is swapped purely via env vars โ no code change to the UI or pipeline.
LIVEHERE_BACKEND=cosmos
COSMOS_API_STYLE=vllm_omni
COSMOS_BASE_URL=http://<your-gpu-host>:8000/v1
COSMOS_API_KEY=...
We self-hosted it on a Nebius H200 NVLink instance with vLLM-Omni:
vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --no-guardrails
Full deploy walkthrough (Nebius / Modal / RunPod) is in deploy/DEPLOY.md. Cosmos can't run on Apple Silicon โ keep the GPU instance up only while generating, and tear it down when idle.
Cosmos Claw ยท made with โ for the Yacht Hackathon ยท Composio ร Nebius ร Tavily